 Thanks, Claude. And Martin, whenever you're ready. Let's do this. So today, I'm going to present SEMGREP. That's the polyglot static analysis tool that we've been developing at R2C. Here I am. So, yeah, I'm Martin. And my language of choice has been OCaml for 20 years now. Started working on protein structure originally. And then for the last 10 years, I've been working only on, well, on start in startups in the Bay Area. So that has nothing to do with biology. But yeah, that's my background. And our company R2C is a bit in San Francisco. It's focused on security. So the tool we're developing, even though I'm going to spend time on some internals that are more, as we say, an expert on, yeah, the goal is just not just to actually make something useful and prevent and catch bugs as early health developers not commit a bad code by accident. So this is this aspect of security. So what is SEMGREP? Starting about a year ago, well, yeah. This company R2C started the day, it developed SEMGREP, which has a previous history at Facebook. And even before with Coxinell, a tool that was developed in part with one of my colleagues, Johan. So he brought some of that expertise along at Facebook. He developed parsers to pass PHP because they're using that at Facebook. And he added more languages. Then eventually he joined R2C. And I joined the SEMGREP project last summer. They are already getting started with this. They got retraction. People like it. And I'm going to present a bit how it works and why it's appealing. So yeah, in short, it's SEMGREP is a lightweight static analysis tool. It's really just like GREP, but a GREP that understands the source code. So you write your patterns as code, the pseudo code. It's really just really using the family of syntax. And added to that, there are a few extra constructs for expressing patterns. But it's not very hard. So yeah, it's relatively fast. Also it works for a variety of languages. So that's going to be the focus of my talk today, how we deal with all those languages. So you can see the list here. Some are supported completely. Others are still kind of in progress. And yeah, I present the tech behind that. So first let me show you a bit of SEMGREP. So well, there's a GREP and there's trees. So what you can do with GREP is, well, it starts easy. If you look at the first exec, well, you can catch that with GREP. You have to put a backslash in front of your parent thesis. But other than that, it's going to work. Okay, now third case. Oh, there's a space before the parent thesis. So okay, we have to take care of that. So modify our pattern and add a backslash S. For white space. And then, oh, well, multi-line patterns. That's really tricky with GREP. Actually, I don't even know how to deal with that. And oh, a function is not really called exec. It's something else underscore exec. And our GREP pattern would probably match that, but we don't want that. And well, comments. Well, GREP is not going to understand comments. And same for string controls. All those things get really hard to deal with. So either you are too restrictive, or you have very complex patterns that are brittle, or you just, you have catch too many things which are false positives. So, well, all right. So yeah, that's because GREP works on strings, and programs really are meant to be understood as trees. And that's the way we choose. So that's the correct way. There are many tools that deal with trees for tree matching. There's a short list here. They're all specialized. That's the thing. They are specialized to one specific language. They will be good for certain types of linking, or this or that, but not super generic. And yeah, we have to set up and learn a new tool for each of those languages. So that is difficult. And if you imagine if you are in charge of security at your company, you're going to want to review and audit all the code. And you may want to catch certain patterns, certain bad usage, and you have to deal with all the languages that they use. Sometimes it's configuration files written in a special syntax. And all of these, well, we kind of want a way to find programs in those. So, and we don't have no tools for everything. So we're trying to give that. All right. All right. So that's a blog about the two extremes. And we sit in the middle. So the Regex-based approaches are, well, fast, but limited brittle. And on the other end of the spectrum, we have complex tools that can be slow or hot and hard to set up. So, yeah, we're trying to find between this sweet spot where it's kind of, you can do the easiest things easily, the simple things easily, and also get some more advanced stuff done. So let me show you the basic. So the basic usage of SEMGRAP. There's not much more than this, but the essential ones are this ellipsis and meta-variable constructs. So let me start with the ellipsis. So we have this thing. It's a live editor. So let's go there. Okay. So this is SEMGRAP.dev. You can find it the SEMGRAP. This, that's the playground that's going to be used to explore SEMGRAP and to create your own rules. So this is here a little exercise. So in this box, this box here with a to-do, this is where we're going to put our pattern. The examples on the right of what type of patterns you want to create. You can create for Python here. This is Python. We could pick other languages, but this example is for Python. And here are some test, some target program, which has a bunch of calls to exec. And we would like to catch those. Well, the correct ones. That is this one obviously, line four, line six. Also this one has space. We want to catch that, a multi-line one. This one, we don't want to catch it. Also the comment, that's a comment, shouldn't surface. This is a quoted string, shouldn't, that's not a call to a function. So the last three shouldn't match. Let's see, let's see if we can do something like that. So, well, this is our first same graph pattern that we enter here. Where we have a sequence of things. And let's see if it works. All right. So this is nice. We got all the, we got all the calls that we wanted. Even the first one, which is actually an alias. So other function, this other function, function is an alias for exec. And this was caught correctly. So this shows that, yeah, we would try to do a little bit more than just syntax matching. And this is where the same, where the same means in same graph is for semantics. So we try to do a semantic graph. And so, yeah, we're possibly adding features to make it more and more powerful. Oops, here we go. So the other other special patterns we have are called meta variables. And the meta variable is like, it's a capture. So we can capture things that match and give it a name. So here for this code, we're gonna try to find places where we have something compared to the same thing. So something equals equals the same thing. Whatever that is. So if you have here cat, this cat equals cat equal to cat. Well, okay, we want to catch that. We don't want to catch seven equals eight. It's a simple example, which may or may not be useful in practice, but generally we don't want to compare to identical things. So anyway, so let's try. So that's called, this is the dollar. The dollar notation is special semgraph syntax. And so we can do that, dollar X. So if we do dollar Y, that's gonna, that also means anything. X is anything. Y is anything. And this will catch anywhere where we apply the the quality operator. So it catches everything. And if we want to catch something equals the same thing, all we have to do is use the same variable here. And tada. So yeah, it works. If you look here, we have the same exact expression on the left and on the right. Presumably it would be a programmer error. So we can make more complicated pattern than this, obviously. And yeah, so this is basic semgraph usage. They are, we offer other things like, I think one interesting one that I'm gonna mention here is the fix. So auto fix. This example, we have highlighted a call to launch an HTTP server on port 80, whatever, but this uses the vanilla function that's insecure because it doesn't use TLS. And you want to find this one and replace it with a function that uses TLS. In this view, we have the YAML config for the rule. It has a bit of fluff, which is actually useful. That explains the background. So when the flow is detected in the code, the user knows what's going on. But essentially there's a pattern that we want to write and a fix for this pattern. So all right, what is our pattern gonna be? Okay, it's gonna be HTTP dot, well, this thing. And I don't really care, do I care what's in here? Oh yes, I do care what's in here because I'm gonna reuse them. So okay, the call is port and here, what is this? I don't know, gonna call it X. And we want to replace it with the function called the same with the TLS suffix. Okay, so my fix is gonna be this and I hope it works. Okay, so we have a match. This is good. And a suggestion. All right, let's see if we can apply the fix that are and the fix was applied. So that's an experimental feature, I think, in SAMGREP. But yeah, we want that to work well eventually because that's just nice. And so all the integration work of integrating SAMGREP in CI is very significant. So I'm not really working on that. I'm gonna present more of the intervals of passing. But yeah, you get a sense of the sort of things we can do with SAMGREP. And so now this was warm up and you get a sense of, yeah, what we want to do. So we saw that we deal with a bunch, we want to deal with the various languages. Maybe if I back up a bit, I'll show you the languages that we have in there. So we have those are listed here. So all those are normal languages that are programming languages. We have, I think we have JSON, which is just a subset of JavaScript. We have YAML, that's its own weird thing, but other languages are managing this similar way. And then there is this generic pattern matching, which is something else. I'm gonna talk about it also. It's for dealing with unknown languages. But all the other languages are dealt with in the same fashion using the SAMGREP pattern that we just saw. And so for that, we use this generic AST. So the pattern as we saw is, it looks like pseudocode. It's really the same parser as the programs we want to match against the pattern. So it's just that the grammar for the parser is extended with dot, dot, dot and meta variables, but they use the same parser, which is interesting. And they both result in some AST and call that the generic AST. That is, regardless of the language we start with, which is in these examples here, we have Python, Ruby and JavaScript. We use one technique or another to get to our generic AST, but they all converge to this generic AST. And which is a uniform representation that accommodates all the languages. And that's quite nice. So here we have three passing flows, but the main one is, the one we prefer now is using tricitor. I'm gonna explain a bit what tricitor is. And it goes in two steps. And the first step is, well, we have a whole machinery that gives us a CST, concrete syntax tree, also known as the parse tree, which has all the details, so all the semicolons, parents and so on. Whereas an AST might strip these off and it is nicer to work with. So from tricitor, we get a CST and then the CST is adapted, is mapped into the generic AST. So I'm gonna show a bit of that. We also have legacy parsers, that the ones we came with, with the F that UN developed at Facebook. So Python, for example, uses this parser. It's actually written using Manier, which is parser generator for OCaml. There's nothing really special about it. It works well, but these parsers need to be maintained. All right, we have another flow that's more complicated for historical reasons, also based on tricitor and some also on the old stuff. So I'm gonna mostly show you the tricitor stuff today. So yeah, so anyway, once we have, when we have a generic AST for our pattern and for our target program, we can compare them. And this is what our matching does. So we have a single, well, pretty consequential piece of code, but a single piece of code that takes care of matching a pattern against a target program. And this all works on our generic AST. So here I can show you the proof here. Earlier I ran our parsers on the Python program on the left and JavaScript program on the right. They are really simple and you can see the dump of the generic AST. So if you look close, you'll see that they are the same we have a function definition in both cases and everything is very similar. There's no notion, there's no clue that we're using Python or JavaScript in these ASTs. On the right, we have console.log instead of print. So that makes a difference in the tree. But yeah, this is our generic AST. And as you can imagine, since we accommodate all the languages and each language has its own set of features, well, we have many nodes, many possible, many kinds of nodes in our tree. So this is a snapshot. I could show you more details about this tree. It's not that interesting. But it's about 2,000 lines of type definitions. And for those interested, we started an effort to make those export the AST in a JSON format that will be documented and usable from other languages. You can ask me about that if you're interested. But yeah, we think it could be useful for others who like to experiment on doing stuff with programs and doing so for multiple languages, not just one. So this huge AST needs to be traversed when we do several things, different operation. One of the big operations is matching, but we do other things, constant preparation and optimizations and this and that. And we can't really have a thousand lines of pattern matching each time if we want to just scan the tree for something. So here's a little example. I took the snippet of code from an optimization that we do and that needed to access to visit statements, statement nodes, and identifier nodes. But it doesn't care about the rest. So if you look at the K ident identifier here, that's a field in an OCaml record. So if you're familiar with the Vista pattern, that's great. Otherwise, yeah, I'm going to explain it. So what it does is there is a generic function that's written once and for all that will visit all the nodes and all the kinds of nodes in the tree. And when it finds a node of type, in this case, ident, it will call the function that we provided that is in charge of doing something with that node. This is the little anonymous function we have here, which does something with this ID. If you do the next one, the other one is K is STM, STMT that gets called when we visit a statement. Also, we do something for a statement. So this machinery is works. It's typically used in other languages and OCaml that don't have pattern matching and they won't have algebraic data types. In OCaml, we have those. So this is a little, feels like a step, I don't know, it feels a bit heavy, but it works. And well, if you need to visit and just identify us, for example, in your AST, it gets a job done really fast instead of writing a thousand lines of recursive functions for dealing with all the possible cases. So that's how it works. I have a quick question. So your generic AST, does that carry enough information to do like translation between languages or is it short of that? So you saw the autofix, I showed the autofix example. So we do indeed, I haven't worked on that myself. So I don't know how well it works. We have some way of printing back into a concrete syntax for a specific language. I don't know how well it works or if it works for all languages. But yeah, so in that sense, there would be a way to go from concrete syntax to generic. And then pretty print that generic snippet into the chosen language that would do a translation. Yeah, we have to look at the quality of this transformation. But yes, at least some of it is supported. So once we have this generic AST, we're good to go. Basically that's my message here. There's a lot of possibilities that are offered. We don't need to worry about specificity of... Well, there are still specific features of some languages that are present in this tree. But we could also choose to not deal with them depending on what we're trying to do. If we just want to find the function calls, don't need to know about all the features of languages. Yeah, I would imagine trying to make something that also understands like Rust Barrow Checker or something like that would be... Yeah, so we're going to have some Rust-specific nodes in our tree. We have this for many languages. Sometimes there are some kinds of nodes that are specific for one language like OKML, for example. It's functional, so it doesn't have state... Well, it doesn't have state... Things are organized a bit differently, so... It works in the end. That's nice when we get to the generic AST. That's my message. Getting there though is interesting, but I'm going to talk about now the various passing technologies. So as I mentioned three sets earlier, I'm going to talk more about that. So the main two pass generators we have, we're using now in our main year, which is an LR1 pass generator for OKML. LR1 is a type of grammar, if you know or remember this. One means this allows the passer to choose which brands to take based on the looking ahead, the next token, and most of the time it's good enough, but not all programming languages are designed to work like that, so sometimes it's harder. And actually, this other tool, tree setter, let's us do that. I'm going to explain it more. So we use those two. So we migrated to tree setter last summer, and that's a big, big interesting thing. We also have handwritten OKML passers when they are available. Like for YAML, we had a YAML recently. Just used a YAML library for OKML, and that's great. It's not very simple language actually, so we're happy to use library. And I'm also going to talk about something that I developed, which is this catch-all fallback called space grep, which is a step below some grep, but it's more in the spirit of some grep, so it's kind of between grep and some grep instead of the shared grep suffix. All right, so let me show you a bit about tree setter. Tree setter is a big project that was created by Max Brunsfeld, who is working interestingly at GitHub. GitHub manages this project now, and it's open source, so everyone benefits. It was designed specifically for editors, so it supports incremental parsing. That's not something we use for purposes, but okay. And what it does do, well, it's a parcel generator. It generates C code. That's nice. As is for us from WorkML, we can call this C code and get the parse tree. Great thing is that they are grammars for this. Many of them were developed by Max, but now there's a bunch of community members, including ourselves, contributing to those. So we are grammars for many languages, like maybe I don't know. How many is it? Something like 30 of the most popular languages. It's increasing slowly. And so why are we using tree setter? Well, it was a very irrational, well-considered choice. Problem is that maintaining grammars in old camel for many languages is a lot of work, not because of our camel, but because we are on our own. So instead, tree setter has many contributors. So the community will contribute to the grammars. We also contribute to the grammars. Everyone is happy to benefit from each other's work, and it's a lot more productive because indeed writing the grammars is time intensive. And yeah, not always obvious. So that said, tree setter has some unique features as well that we take advantage of. So one of them is GLR passing. So GLR stands for generalized LR. And so if you've worked with parcel generators like YAC, sometimes you get a conflict, static conflicts. So at the time of processing the grammar, YAC will tell you, oh, you have a conflict between this and this. I don't know what to choose. And you have to specify precedences. And great. And if you're able to do that and determine statically what rule or what branch would be picked, that's great. But sometimes it's not possible. And this can be very tricky. So instead of doing some weird pre-processing work, there is this dynamic conflict resolution in which the parcel will try all the possible branches. And if one branch works and the other ones fail, that's great. We take the one that works. If several succeed, there's a system of scores and the ones with the highest score will be picked. So it's a very nice way of getting out of difficult passing situations. And it's optional. It has to be specified in the grammar. So that's also nice. So it's not the default because it's a little slow and you try to avoid it. But it's really just one option that we have to make two sort difficult problems. So that's really nice. The other one is error recovery. So error recovery, when the parcel finds the region of the program that can be passed. But it can be skipped over. Then it does this. So we get an opacity. We see an error node somewhere. We can happily ignore it. And the rest of the AST is CST. The rest of the tree is still valid and well formed after removing this error node. And that's really cool because, well, if you're in a text editor, that's a benefit because you want most of your program to be syntax highlighted. And what you're typing is not highlighted. But that's okay. You know it's broken. So this is good for this situation. In SEMGREP is great because sometimes some new syntax is not supported by our parser. Or it can be that there's a programming error that happens. But most of the time, yeah, is that we don't support a specific syntax feature. Maybe because it's too recent. And we are still able usually to pass most of the file and ignore the line or the few lines that have the error. And that's really nice. So this way SEMGREP can find most of the code, 99.99% of the code. We have a good grammar and just one line here and they will fail. So those are the good things about TreeCitter. So this is, I don't know if I should really go into the details, but this is a TreeCitter integration work we had to go through to make it work with your camel. And there are multiple steps. The first step is, well, natural. We have to add, since this is going to pass SEMGREP patterns with the parser, we need to support dot, dot, dot and meta variables. And we extend the grammar for that. That's fine. Then we have a step of code generation that produces a grammar in JSON format. That's very nice to deal with. We simplify it to make it easier to process for our code generator. We feed it again to TreeCitter that produces the final parser dot C that's usually 100 or 200,000 lines of C code. And this is the parser that's going to run and we're going to be linked to a camel code. But our code generator then takes the grammar dot JSON and generates several files. And some of this is all tricky. But yeah, we generate one type for the concrete syntax tree that is very nice to use in your camel. We have a file that does some recovery code because the tree we get from TreeCitter doesn't have all the information. So we have to recover some information. To know exactly what branch in the grammar was taken. That's a little tricky. I'm going to show you something, but don't panic. And there's also a big boilerplate file that's involved in mapping our CST to the generic AST. So some of that boilerplate is generated and the rest will be done by hand or so. So let's first look into nice and clean things that we do as they were intended. So here is a little grammar extension. The language is Kotlin. That's one of the hot languages we are in the process of adding support for. So here on the right, we have a full grammar extension that starts from the official TreeCitter grammar for Kotlin. And this is JavaScript. So it's a whole DSL within JavaScript. So domain-specific language that is used for specifying our grammars. It's pretty nice given that it's plain JavaScript. There's a choice function, for example. Then here the expression. So expression is a grammar rule. And here we say that the previous grammar rule, previous, is going to be extended. So it's a choice that which we add other choices. So I'm sorry, previous is some rule and we turn the expression into a choice between whatever we had previously and these new constructs. So dollar.elipsis is our dot dot dot. And this is another one that I didn't present. We also have something from meta variables. So this is how TreeCitter grammars are written. So the original grammar, if we look into it, it's just like this, but it's like a thousand or 2,000 lines of similar code, of code like that. And now you can see that here we have this choice operation between different rules. This ellipsis rule and this ellipsis rule and previous, we don't even start to clear what it is, but it's whatever the, what is a previous value for the expression rule. That's a choice. So this choice is an alternation and each of the choices will be translated in our AST into a case. So into a kind of node because ellipsis is one kind of node. The previous expression, whatever the right was, it's another kind of node. The deep ellipsis is another kind of node. And we want that to be well typed in OCaml. So we have some machinery to make sure this happens. And this is a bit of what I'm showing here. So if we run the tricenter parser using the tricenter tools for this simple hello program, here's what we get. This function definition gets translated into this thing, which looks very neat and concise and has nodes that make sense. So now the problem is doesn't show us everything, unfortunately, and we have to recover certain parts. So on the left, I highlighted a region where the original tricenter output shows an expression statement. And this expression statement has a child that's called expression. Very nice, you might say. But if you look at the grammar, the path taken is more complicated, and we actually want all that path. So if you look here in this recovered CST that we have in OCaml, we have new nodes here. So the first one, XSTMT, that's one kind of node, that's the expression statement. And there's the call X that corresponds to a call expression on the left, but in between there are two sub-levels that were completely omitted in the original output. So yeah, we have a machinery to recover that, but take away is yes, it works. And it was quite an adventure to get that to work. But anyway, we got some good generated OCaml code, so I'm going to jump to that. If you have questions about this stuff, we can talk about it later. So the generated code looks like this on the right. We include the original grammar for reference. That's useful because it's the most readable version of the grammar. Everything that derives from it gets less and less readable. The other thing is that it's a DUNE-ready project, so DUNE is the build system for OCaml. It's a completely modular. So you can plug this specific repo. You can take this Git repo, plug it into some modules into your project, and if it's already a part of a DUNE project, DUNE will find it and will build all of this very nicely. So thank you for doing people for the great tooling. It works really well, and that makes it easy for us and for users. Yeah, let me show you a bit of the generated code. Okay, so this is for Ruby. We generate one Git repository for each language. It's because it's convenient. We like to work on languages independently. So anyway, this is a large amount of generated code. This parser.c, that's the main thing that 3-sitter generates. I think we can't open it because it's too big. It's in the hundreds of thousands of lines of c-code. And we have our OCaml code, in particular this one. This one that is in charge of doing this recovery. I alluded to. Well, this is part of the machinery. It's all nicely pretty printed. If you know what's going on here, you can recognize a language like alt, alternative sequence. Those are essentially regular expressions. And we've run regular expressions on the children to figure out the kind of the different children we have in our original CST from 3-sitter. And the regular matching allows us to match the children with the anonymous grammar rule. And then we can name every piece. It's not supposed to make sense. But I like to watch that generated code is always so satisfying. And it's pretty long. So I always feel good about this because I just don't have to write that or even worry about that or even open it ever because it's pretty mechanical. It's debugged really when you write a generator once and for all, but then it just works. It's cool. Now the CST.ML, that's what we're going to consult as a reader and programmer because we're going to have to do some work on the CST to convert it to the generic AST. So the CST type that's generated by our tool or chemical capacitor and all those types were generated from the grammar. So this is a mirror of the structure of the grammar. And it's played everywhere we have a choice in the grammar which is an alternation between different rules. This results in an algebraic data type. So this one is well, it's almost an enum. But yeah, each of the possible cases receives a name. Here we have got cute names based on the actual ASCII characters involved. Sometimes it's less, sometimes you're trying to give good names based on what we see. You need to give good names that are meaningful to a human and that are also, let's say, non-conflicting and stable. So there's a whole art of naming things. That's one of the two or three difficult problems in computer science. So yeah, here we go. Sometimes we have things that are named less nicely like this. Here is one. Here's one. This one was called an un-choice CST, some hash. That's the best we can do because it's inline the patterns, inline things that don't really have a good name. And also that's the name that's regenerated. Anyway, so this whole thing is for, is this concrete syntax tree definition for Ruby language. And I think your grammar is about a thousand lines. And this one, this generated file is 1200. And yeah, so let me go back to the slides. So now, okay, so with FreeCitter we get a very nice CST via some, okay, I'm all transformation. And the thing is the CST is very specific to the language we are passing. So we only needed to translate the CST to the generic AST. You might think, that may be a lot of work. And why didn't we write the grammar by hand? Well, writing the grammar is the most time consuming operation here. Because, yeah, converting the CST to the AST is some manual labor, but it's not that hard, especially once you know the generic AST, what construct is has. It's not that hard, so we still prefer to do that. So yes, we're doing this crazy operation of starting from this generated boilerplate, which is a function that maps. So here's just one example. We have this map argument list. So argument list is a type of node. And this defines a function that maps an argument list node into something else. And this something else should be something equivalent in the generic AST. So what we could generate is this boilerplate with to do. So we have to do here, to do there. And that will have to be replaced by constructs of the generic AST. So that's the labor intensive part. And so by generating this file, already automatically we get half of the work done roughly. I can show you the original generated file. So the general file looks like this. And this is all boilerplate generated boilerplate. It's it will be fun to write by hand. It's nice structured in a somewhat obvious and repetitive fashion. So that's nice. Now we map this by hand. And then we have to maintain that. But this is the hand-modified version. So we replace all the to-dos with actual constructs. And what can I say? For example, here there was a to-do already. Now it's an empty list. So we have something that a list of parameters or something here. And yeah. And so we have to maintain this mapper. Again, this is pretty long, but it's not that hard. All right. And so that's all for the tree set of stuff. Hope I give you a good overview of how we do things. But it's not that simple. There's a lot of cogeneration passes. And it's very satisfying in the end because we feel very powerful. We can deal with all those languages at once. And they all converge to one way of doing things. Now, the other thing I mentioned earlier is, well, what if we have a configuration file for, I think, Terraform has their own syntax for config files. We have maybe exotic languages that are used here and there in some code base. Or maybe we want to match some HTML snippet that's embedded in some special templating language. And we do support JSX, which is a React extension for JS, but maybe we don't support the specific extent, the specific templating mechanism or whatever. So for those, we have the spacecraft tool, which is our last resort. And so we didn't want to... I thought it was too bad to not be able to do something like semgrep when you don't have the grammar and have to resort to grep and grep being so unaware that things can be on multiple lines. It was frustrating. So this space grep tries to do things like semgrep when possible. It should be usable by a user out of the box. No configuration needed. The right pattern that looks like the actual code they want to match. We don't need backslashes. I think that's very pleasant. And yeah, it's just easier than grep. It's kind of like semgrep, but not completely. So let me show you what we can get out of it. That's the screenshot of what we get. Let me show you in the live editor. So I entered this exact pattern we had earlier. Here you can see the language chosen is generic. Actually, you might be able to run the same exact thing within Python. All right. So we don't catch the notion that there's an alias. That safe function is the same as the exact space grep has no idea. It has no idea about, well, it has idea about identifiers. It didn't match this one. That's nice. Multiple lines are not a problem. And what else? It doesn't know about comments. It doesn't know about string literals even. Now we can still do certain things. For example, if we want to match anything that's in the string, that works. That should match only this one. Yeah, this is correct. Maybe we can do something like that. Maybe we want to catch all the execs, but exclude those that use a constant string as argument because that's considered safe. You don't want to execute arbitrary code, but executing a hard-coded command. That's perfectly fine. So let me try this. So this should match the dangerous execs. And so, yeah, we can do that here. So we didn't catch exec ls because this is secure. We are somewhere. We're not sure where somewhere comes. And that could be dangerous. If you compare to Python, actually, this syntax can be used also on Python, I think. Let's try it. It's a live demo. But, yes. So Python works better. It has, that's the regular same graph. That has a notion of comments and strings. But other than that, yeah. So you can see that the generic mode is limited. But for simple things, it's going to work. And we might get false positives. And well, that's okay. But, yeah. So that's the thing we have. So how does it work? Well, I think I wanted to show you a bit of the OCaml code involved because it's so much simpler than the full-fledged generic AST. So the view of the program is really simple. A program is made of atoms that are one of those three things. It's a word, a punctuation, or a byte. And a node in our AST, well, AST or CST for spacecraft is a list of nodes or an atom. And this is all the definitions we have. On the right, we have the tree corresponding to also the AST definition for a pattern. And you can see that we have these extra constructs so for meta variables and dots. And yeah, we do support some meta variable matching in spacecraft. And yeah, that's simple and nice idiomatic OCaml. Same for matching. So for matching a pattern against a tree, that's very idiomatic OCaml. I think I'm just proud of this code. So I wanted to show it here. Like that's the way of writing OCaml code that's nice. And like we learned in school, that's not too complicated. Once you understand how to write recursive functions to deal with trees and you get the notion of pattern matching, that should be understandable. That's I think that's all for spacecraft. So I'm going to conclude my talk here. And yeah, so this is our program analysis team with managers for people. Emma, Yago, Johan and myself are working on the core of SAMGREP and making it better every day. Thank you all for your attention. So if you want to install it, you have instructions here. You can find them online. We have this online editor, the playground that I use for the demos. And you can play with that. Feel free to reach out on Twitter or to me directly. And we'll survey here if you're interested. And that's it. That's all for my presentation. So let me know if you have questions. If you are inspired or terrified by this. Definitely the former. This is pretty great. So how how incorporated has SAMGREP become into your development process? Into ours. Well, we use it on our source code. You know, dock footing. We don't do very much with it. I think it's very valuable for we're doing static analysis. So even though we run it on OCaml, we don't find that much stuff. It's really, I think it's going to be useful for like real like, to say applications, web applications that involve a lot of, have a lot of IOs, especially user input, tainted input. There's a lot of security issues there. So yeah, we haven't, we haven't, I myself as OCaml, someone involved with OCaml didn't have to fix many errors. When I touch on Python code, something surfaces up sometimes. And yeah, yeah, it's nice. It's, we are not submerged by false positives, as far as I can tell, which is, which is good. Yeah. If you have a perspective on security issues, I think we have talks by our colleagues. I think Clint Gibbler gave something that's available online. I can give, I can give you a link. I think I posted it on Twitter earlier, where he goes more into details about real, real word usage. Yeah. Did I answer your question? It's fun if you did. You use it occasionally. But it's, if you're using, if you're writing OCaml, then everything. Yeah, that's the thing. It's sort of weird. So this is a functional programmers group. So I'm going to, I'm going to say things that should sound familiar to functional programmers. But it's like a kind of a niche. I mean, at OCaml is not extremely widely used. And we use it especially here for, for the kind of things where it shines really, that's, so we are kind of in our little world. Yeah. What can I say? Yeah, I lost my train of thoughts. I kind of want to repeat, I don't want to repeat things. But yeah. I did have a question. So like what kind of, do you see any applications to like editor tooling? I mean, I noticed that there was a sumgrep plugin for VS code, but have you seen like a lot of use there? It seems like it'd be really powerful. Yes. Yes, it is. So we have excellent colleagues. We take care of that. I haven't really followed myself. And yeah, there's a lot of stuff to do there. I think there's some VS code plugin. Not sure about other editors. But yeah, that should run in the background. And as you write code, it should find problems. So the integration is indeed a big effort, a lot of work to be done there to have things up and running quickly in CI. And that should run fast enough also. So speed is more on our side to make sure things are fast enough so that that's an issue. But there's some convenience. We want to make sure that things are reported correctly and are easy to fix and not surprising. So we have all these rules, these rule sets, so libraries of rules that are made over time by some of us, also some external contributors are in there for catching problems. A rule is a sumgrep pattern usually, but at least usually are several patterns to make things to capture more interesting things. And a message explaining what's going on. There's possibly a fix like I showed earlier. And so all these rules, I think we have over a thousand rules now in samgrep and people can use these rule sets. That's the configuration here on this slide, R2C, that's one of the rule sets I think. It's meant to be easy to use in practice without having to create your own rules. However, you can make your own rules. If you have a specific code that's a little dangerous and you don't want your users to call it then, well, sure, you can make your own rules. Yeah, so yeah, don't hesitate to ask more or if they didn't answer your question. Does anyone else have questions? Or do you guys know about OCam all a bit? Or I'm not sure about the background of the audience or were you... Yes. There are a few ML line language users here. John, Mr. Bremmer, myself to a small degree use F-sharp and then Claude is mostly focused on OCaml and I believe there's some other OCaml users in this group as well. Yeah. And yeah, if you have comments on OCaml, is there anything you find strange about what we're doing? I know I've been in different teams over the years. There are different ways of... There are different types of applications on being done with OCaml. I know that at some point I was doing... I was in a web startup so we're doing... We had the whole backend that was written in OCaml. So then with that we're using LWT and the big thing. So we had an HTTP server and a lot of JSON was involved and that's very different from what we're doing right now. Yeah, different people so I prefer different styles. Well, this would be strange to not like it but I really like our style right now which is kind of we try to keep things plain and simple. OCaml has advanced features that has modules or the module system of OCaml, a module. And there are modules and functors. So a functor is a parameterized module and now they are so we can parameterize a module with types or with data. And you can have also now in recent version of OCaml we can have first class modules that can pass around and you have to unpack them. So use them sometimes they are viewed as module or viewed as values and there are ways to convert those. And I know some people love that and I don't know how to use that really well. So we try to keep things simple. I guess that's the style that we have here. Do you find yourself using for example in situations where a library version changes and the surface area that interface changes to that library do you find yourself doing transfer using this to do updates to code bases using this tool? There was a syntax using replace there. So you could define one pattern and define the other pattern. Could I jump in with a concrete example? I'm thinking maybe going from Python 2 to Python 3 or something like that. Is that kind of what you're thinking? The module or the library that supports HTTP requests had its interface changed between this version of the library and some later version of the library and just doing that kind of syntax transformation within the same language of course but with modifying to support the named function for example. So the sort of thing that should be possible I don't know if in practice that's something that works really well. I mean I can imagine that APIs change because they don't change in a one-to-one way. Like it's not just a name that changes and keeps the same exact meaning. If they change the name or something it's because maybe it's different. I'm trying to imagine sorry I have to speculate. Yeah but I think it's a it's a valid question and application and so coxinell is the tool that Yuan was working on origin in France and they are focused on C and they have a way of searching for code and patching it automatically. And this stuff is used. So yeah it's used on the Linux kernel and yeah it's good stuff but it works only for C. And yeah so I seem to remember some other projects based on there was a company being built off of a tree sitter that was doing some of the things like that but that was you know that's not open source. So I was just curious if SEMGREP could be kind of used in those scenarios. Yeah yeah I mean yeah we can use SEMGREP for various things. I think the focus is really I really should start up things change really fast also and I don't speak for all the companies because I'm really focused on the the internals arm. So yeah I don't want to say something stupid but yeah it's always their customer first get to make things that are needed that are in the heavy demand. So I would say it's security focused arm and trying to overall we're not we're not how to say it's not like dynamic dynamic analysis where you would try to crash a system. Here we are more we are more about helping developers avoid to not write bad things by accident. But yeah the the general like to say vibe that I get is like yeah we want to to help developers first to not have tools that get in their way. We don't want the security hat person to impose tools to the developers that the developers will reluctantly use. So yeah so yeah that's what we're trying to do. I don't want to say yeah I kind of digressed a bit. That's kind of what we do anyway. Our meetings usually end in massive digressions into addressing subjects. So welcome to the used in functional programming user script. Yeah yeah so yeah it's a start up things things move fast and that's nice. I can talk about other things I don't know if you have questions about the company. Or if you were off you're done I don't know if it's fine but I mean people have I don't have the full view of the audience. Is it yeah um what do you think David should we stop here or are there more stuff we want to discuss or