 OK, thanks. Let's start. So hi, folks. I'm Serge. And we're going to spend 30 minutes or so speaking about Python and static analysis. But well, to be honest, the title is a bit misleading, because it's impossible to use full Python and at the same time have very accurate static analysis. So the real title should be Python or static analysis, pick one. But still, it's not because it's impossible to do it as a whole that we can't do relevant stuff. And that's what we are going to explore together. Whoops, next slide. Yeah, so I'm a compiler engineer, but I have a lot of other hobbies. And I'm going to present some tooling to do some partial static analysis on Python programs based on my previous experience on that topic. So that previous experience is mostly based on the Python project. So some of you may know it, but it's an embedded DSL, embedded into Python, targeted at scientific computing, which makes it possible to secretly compile Python kernels into efficient native libraries that you can still import from Python and get good speed up when you're using NumPy or this kind of stuff. So it's not a generic compiler at all. It's very specific, but it's still a good plan to develop. And while developing it, I met, obviously, a lot of issues. And some of them were relevant enough to build a new package that we are solving some specific issues that I thought would be useful to other people. And that proved to be true. So the two major components of Python that are reusable are Gast and Beneget. So these two package are using names from a small language on the west of France called Bretonac. And I will just give the translation while we go through these tools. The first one gives you an abstraction of the Python abstract syntax tree across Python versions. And the second one provides some basic analysis of Python programs that other compiler or analyzer can use. And I'm going to go through the two of them and then describe another project, Memestra, who's using these two building blocks to write a simple but efficient static analyzer. Let's move on. So Gast, it's available on the cheese shop on GitHub and such. So it could mean generic AST, which is totally relevant, but it also means in Bretonac a good time girl. So this tool started four years ago and was presented by Pycon French. And the original goal of that tool was to use the translation between Python 2.7, which was the original target of the Python compiler, to Python 3. When you're migrating a compiler from across different versions of Python, it's not as easy as when you do it for a regular application because you're changing the input of your program. You're going to parse Python 2.0 program or you're going to parse Python 3.0 program. So that's very different from adding parent phases around print statements, right? So instead of reworking the compiler for each new Python version, I wrote this small abstraction layer to ease the manipulation of the AST. Basically, it gives you a generic AST that you can manipulate whatever the underlying Python version you're using. It turns out it's very successful because there are more than 100,000 downloaded per day according to IP stats of that package. But you don't need to trust numbers. It's actually a dependency of terms of flow, not my fault. And so only one package is actually using it a lot in addition to my package. But that's an important package. So it looks like a very useful package, but it's not as useful as it seems. So to give you an hint about what it tries to do, if you run this simple Python line, you're just importing the AST package that's going to, and then you dump the results of parsing a simple subscript expression, which is valid Python across Python 2.7 up to 3.9. And we are going to see how the Python compiler is seeing that expression. And it changes a lot depending on the actual Python version. So if we are in Python 2.7, at the top level, we've got a module. Because even if it's a single expression, it has to be in a module. This module has a single statement, which is an expression. And this expression is the actual subscript. And the important part is that the slice field of that subscript is an extended slice, because it has several elements. And the first dimension is an index. And the second one is an ellipsis. And that's how it's represented in Python 2.7. And you can build your compiler based on that. But if you happen, and I hope you did, to move to Python 3, say Python 3.6, then you still have the module and the expression and the subscript. But then the slice is no longer an extended slice. The node is now an indexed by a tuple. OK, so it means that if you want your compiler to work both for Python 2.7 and Python 3.6, you need to handle two different ways of representing a subscript. But even worse, if you go up to Python 3.9 because you live on the bleeding edge, then you no longer have the index. And it's a single tuple. So it means that whenever you've got a new Python version, be it a major or minor version change, you need to update your whole compiler because you're not matching the same abstraction, the same abstract tree. So what I did is put all that glue code in the guest module. And whatever the Python version you're using, if you're dumping the AST using guest instead of AST, we've got exactly the same API just to ease the transition. Then the module is something. The AST representation is very close to the 3.91 because we tried to match that one. But it's also compatible with all the previous version. So if you're running that exact code in Python 2, you will get the exact same abstract syntax tree, which means you can then think on the guest representation instead of the official AST representation and your compiler is fine with that. That's a super nice property. And it's not getting obsoleted by the doom of Python 2. It's also very relevant to all the minor Python 3 versions. So that's for guests, very simple building blocks. When you do that, you've got some trade-offs. The first one is because of legacy compatibility with Python 2 and we're still keeping that. Our AST representation is slightly more verbose, especially for AST name. But that shouldn't be a big trouble, but it's good to know it. And we are relying on the official AST parser. We're not implementing it from scratch. So there is a translation from your Python AST module to guest. So it's an extra processing step, which shouldn't matter even for large modules. But it does matter if you're trying to pass all the Python file in your system, which is something I tried to do. And in that case, you've got something like a factor to slow down because you're creating a bunch of new objects. So you have to pay for it. But then you're going to Python, so you don't care that much about paying for doing things more in an easier way. So that's for guests, very useful to, at least to me. And it proved to be useful for other people, too. So then benignette, same available on GitHub and on the cheese shop, used by a few other package in addition to Python. And this time, it's not a bad word. It's blessed. Just the same, you would say, holy shit, you say benignette when you're surprised or when you want to protect yourself, something like that. And this analysis, this package, to understand it, you need some compiler background, but a very basic one. And I'm going to introduce them in the next slide. So it's computing the used definition chains for Python program statically. So it's a foundation for a lot of more advanced analysis, especially the Python one. But to understand how cool it is, you need to understand used dev chains. And when you need to understand something, you go to Wikipedia. So Wikipedia tells you that it's basically a used dev chain, is a link or a tree between a definition of a variable and its usage at another point in the program. So for instance, if I'm defining variable a and then using it, then there is a chain between that definition and that usage. And because of the dynamicity of Python, there is a lot of way to create chains between identifiers. And some of them are a bit tricky. So for instance, but first, what is it useful for? If you've got a definition that has no usage, then it means that this definition is not useful at all. So you can use the negate to write some kind of a linter that would detect that you're defining a variable somewhere or assigning or creating a binding to be a curate. And that binding is never used as where. So to detect a news import or a useless assignment, it can be a useful tool. You have to take into account a few things. For instance, underscore is generally used when an assignment is useless. For instance, when doing the structuring. And when you're importing a module, even if that module is never used, there are a lot of side effects. Like this module may not exist. And you may actually have imported that module just to try to import it under the import exception. Or the module import may have a side effect because code is actually running. And so it may not be relevant to remove that import. But that's up to the user. And that's technically not something a static analyzer can decide for you. Yeah, so just a reminder, you're not defining a variable when you use an assignment in Python. You're creating a binding between a value and a string identifier, which is the name of the variable. It's somehow being a bit pedantic. But once you've got that model in mind, it helps a lot to understand all the limitations of use depth change for Python. For instance, here is a small loop. And I'm iterating over L. And in some case, I'm printing G. And in other case, I'm assigning it. So is the print statement faulty? And the answer is the same as for all the tricky case actually. It depends. It depends on the runtime content of L. Depending on whether L is empty or not, whether L has values that evaluates to false or not, then the print statement is going to be faulty. But according to the use depth change, there is a possibility of valid depth use, which is the L's close is defining G and the true branch is using it. So that's fine. So it looks like perfectly valid code. And it may not be. That's dynamic behavior that's benegate is not able to capture. Well, it could send a warning actually, but we're not doing it like that. It's static analysis. Another case, you're defining a global using the cursed keyword global. And you're doing that inside a function and doing the assignment inside a function. And in another function, you're using X. Is that faulty or not? Well, according to the static analysis of the depth use change, there is a depth in foo and a use in bar and that's fine. Or at least we can represent it. But it's not capturing all the possible usage. It depends whether bar is called after or before foo. And we don't know that because we don't know how the module is used. We're only doing a per module analysis, not a world package or word analysis, which however would depend on the Python path which we don't control. So it's not possible to decide whether it's the only possible chain or not. Still dynamic behavior. Another case, which looks super simple. We've got an assignment for X and then we're iterating over a sequence, eventually assigning X to the different value in that sequence and then printing X. So the print is a user. But to which depth is it referencing? And in that case, it may reference the first assignment or the loop assignment and we don't know. So we represent both. Both values are potential definition for the print usage. So with that in mind, what does Benny get do? It tries to compute all the possible, but it may not be the actual, all the possible definition for any usage in your module. So that's super useful, but it's only a super set of the actual definition use links that may happen at runtime. So that's the difference between dynamically creating binding and statically assigning variables. But also, although it seems I'm super pessimistic, it turns out that if you're writing a compiler for a subset of Python that's with static assumptions, which a lot of people are doing, then it may be useful. It may be useful to write, for instance, a simple linter. You may know by lint or by flakes, this kind of linters. With Benny get, you can write a much simpler linter, but only with a few lines. You're basically iterating over all the depth use chains. And whenever you've got a depth that has no use, you check if the depth name is a conventional bypass, even if someone in the chat told me that it may not be as conventional as I would expect. And if it's not a bypass plus a few other checks that I removed here for the sake of fitting into one slide, you can issue a warning. So that's a good first step to write a linter. And the good thing is that this linter does not depend on the Python version because Benny get itself depends on guests. And we already saw that guest is agnostic from the Python version. So with that, we've got that is really generic and with all the concerns splitted across several package, which is a nice property from an engineering point of view. And to illustrate that, so first, obviously you understood that Benny get can do anything if you've got a call to the global in three six or the evil in three six or to the local in three six, anything that is or double underscore imports, all this kind of stuff. We can't do anything about that. Just be aware what actually what we could do. And that's what we do is whenever you reference globals, you reference the word. So it creates a lot of links from all the possible global available in the program. And you're referencing them, but that's still a subset of the possible globals. So that's not enough, but we're doing our best. Just be aware, Python or static analysis and not Python and static analysis. That's the key. And so as an illustration to how to write a useful tool based on the generic tool set without too much efforts, let me introduce memestra. So memestra means two things. People from Brittany are known to drink a lot. And so it could say the same, please when you're ordering a beer or Cedar or whatever, but it's also when you're upset by the children or by other people. You could say, oh, please stop. And then you would say memestra. So stop to write deprecated code. That's what memestra is trying to do. So it checks your code for places where a deprecated function is used. And what is a deprecated function? A deprecated function is a function decorated with the deprecated decorator, which is not standard, but you can have your own in your package, or you can use the one from PyPy, which is nice. And based on the tooling I introduced, even you could feel or have a good hint that it's easy to write that kind of useful linter based on Benny Gates. Basically, it's three steps. The user is going to give you the signature of the decorator, say decorator.deprecated or something like that. And you're going to track the usage of that decorator into your code. Track the usage from a definition that's Benny Gates. So once you have all the usage, you have all the function decorated by this deprecator, deprecating decorator. And then you can track all the usage of this deprecated function. If you have none, then that's cool. If you have one, is that usage inside a function that is itself deprecated, then that's okay because you're still into the deprecated word. But if it's not, then you're using a deprecated function in a non-deprecated function. And that's something the user needs to be warned about. And so you print it. And that's it. That's how you write a deprecated function linter using Benny Gates. So as with all commercial or simplistic talk, it's a bit slightly more difficult than this, but the idea is the same. So for instance, in that test.py code, you're importing the decorator package and you're deprecating foo. Foo is used both at top level and inside the function bar. Bar is not marked as deprecated itself. So if you run memestra on this module, you will get two warnings. And that's what we would expect. But if you think a bit, you maybe think about NumPy. NumPy has a lot of the API of NumPy is quite large. And so it's marking a lot of function as deprecated because the API evolves over time. So in your client code, you want to parse NumPy code to check whether everything is deprecated or not. So it means that instead of doing a module analysis, you want to do a cross-module analysis. So it's a bit more difficult than what I've told because you're assuming that the Python path is known and you can then start to resolve statically the imports and then applies the very same method as the one I described in the previous slide. So to do that efficiently, you need a cache which actually represents correctly how modular imports in Python. So that's good. And you can process the whole Python, including NumPy stack in a matter of minutes using MMS tracking or also deprecated usage. So that's still possible and can be done efficiently. When you want to advertise deprecated usage, I should mention deprecated package, which is not official, but just does a job and it's used in a lot of place. So you can just use it as you would expect you deprecated function and you can even give a version or a reason why you're deprecating it. So feel free to use that and MMS tries obviously compatible with that. Actually, MMS tries independent of the decorator. You can configure that in the decent way. So as usual, there is limitation and in that case, typing is in our way. Here, the foo method is marked as deprecated but only within the class foo. If we have another class with a foo method and it's not deprecated, then we can't resolve in bar the call to foo and say, okay, foo is that deprecated call and then we are doomed. MMS tries does support flagging a class as deprecated and is that case whenever you instantiate that class, you'll get the warning but for method, it's not the case. We could imagine a coupling with MyPy or using type annotation to improve the analysis but that's far beyond this talk and but that may be a relevant goal for MMS try. So we are getting close to the end. So again, Python is not made for static analysis. Have a look to MyPy and how they are trying to do static typing on Python, there are limitation and so for any static to link, there are limitation and you need to be aware of them. Once you are aware of them, that's okay. You can start building two links and sharing them as I do and as other people are doing, obviously and it's still a perfect tool for embedded DSLs into Python and a lot of people in the Python numeric community are doing that. And to finish the talk, I'd like to thank Sivan Corley for the original idea of MMS try. Mariana for reviewing a lot of my pull requests on MMS try and Lancelot 6 for the proofreading of this talk. And I'm handing it to you, Nicolas or to anyone who has questions. Okay, so we have two questions. So I'm going to take the first one. David is saying, can you use Python IST module to parse code for other languages such as C? If not, maybe you know some specialized libraries for creating ISTs for other languages? So no, you count. The Python IST module is specific to Python and as I told, it's even specific to the actual Python version you're running. You can't parse any other language unless it is syntactically embedded, strictly embedded into Python. So if you've got a subset of Python, you can use the Python IST module to parse it and that would work because that's only parsing and not trying to interpret anything. But it happens that I'm also an 11M developer so I can answer your second question. If you want to parse C or C++, Clang has Python binding to do that. So you can use Python binding from the Clang compiler and you get a representation of C or C++ that you can manipulate from Python. Cool, so last question. Artem is asking if IST applicable only for functions what about class level or global variables? I didn't get the first sentence, is, pop, pop, pop. If IST applicable only for functions? So if the question is about memestra, then yes, it's applicable to both functions and classes. But it's not applicable to function inside classes. So not to method. Okay, cool. So, we're on time. Thank you very much. Thank you for presenting. Have a good day. Thanks, Nicolas, and thank you to all.