 Thank you very much. It's really my great honor to be here to speak at the conference So the one at the top. I'm going to present today is called a static of analysis of framework So before I'm going to start it can someone tell me how much do you know anything about a static analysis if you if you can Hand up so many. Okay. Thank you Also, I would like to mention that That actually there are two speakers for this project, but he just Couldn't make this trip for family issue. So I will finish this talk alone So first of all something about About our team. I'm from I'm a final year PhD student at Monash University I'm a member of a smart lab in the department of software systems and the cyber security So in our lab, basically it's called a smart and now smart software analysis and the trustworthy computing So we work on the research on software engineering and the program analysis in particular My own research interest is static of program analysis for Python code So we are from Monash University and located in Melbourne, Australia We are the group of eight universities in Australia. Normally it's considered the first One of the best universities in Australia Now let's talk about Something about software engineering research because I know quite a few of you are actually from industry Software engineering research is kind of it's a very broad concept It's try to analyze the different aspects of software applications of their production and software development and developers So our purpose Are we try to identify the code defects for instance for a large amount Very large code base. We hope to identify where the Where bugs are and are located and if there are memory leaks as well And also there are vulnerability if there are security vulnerabilities. So these are really very Costy for industry sometimes if they do it manually and the second the Topic about the software engineering research is we try to build software development techniques such as IDE supports for instance Code completion or api recommendation. So we provide More and more tools or techniques for developers to write code more efficiently And we also in the broader context of software engineering We also have empirical findings studies such as a so-called empirical software engineering It's a study on the evolution patterns or human aspects of software and software production because a major component of software production Is human so human human issues in software engineering research are very very important in today's talk. I'm going to Give Give the introduction about our work It's about a static program analysis. So many of you although so many of you have the knowledge about static analysis But I still want to give the introduction We study the behaviors of computer programs And by scanning source code only we can discover the code defects during their development phase So that people can fix it automatically or manually So so static analysis is normally required to be scalable to a very large amount of source source code And our project is purely 100% static approach And the reason we use static approach is simply because For many scenarios, we cannot actually execute the program or we cannot actually Generate all the possible inputs for your program. Let's see you have a source code The project maybe is a A million lines of code and how can you use human labor to identify the bugs and how can we Actually execute this program a program can take Maybe a week or two weeks to finish execution And if you wanted to test to test a software system In trying to fit in all the possible inputs. That's unrealistic. So this is the static program analysis So here's an important here's an example code that we have a case variable And it takes the value from input if the case is one then we initialize a variable named a And if the case is two then we initialize a variable b So finally we add one we add one to the variable a and decided to see if you remember the conditional statement here and if the case is two Do we have problems here? Right, so it's apparently is because it is not defined So this is a name error a name error is the example. I'm going to use for this talk Now here's the central topic of today's Presentation if we look at the past 20 years and let's see let's look at how the pattern language is being used by the We simply look at the programming community index for the 20 years The python After the 2020 has overtaken java and c as the major and the most popular program languages most popular popular program language Today right after 2016 as you can see from the diagram that clearly it grows very faster. So So the the question now is Be the most popular programming language How do we analyze the pattern programs and how do we build better tools because it grows so fast? There is a lag between the analysis work and the software development So we let's look at the first the industry solutions. We talk about analysis. So We can summarize that all the major detections offer the static analysis tools for instance the pi tab And pirate pe by microsoft google and facebook or meta. So these solutions are basically About a specific problem. Let's let's say that three of the three tools i've listed here about the tap checking and tap inference So this is not applicable for a broader challenges such as dependence analysis analysis I guess many python developers are really frustrated if they See some dependency usually you get everything done But when you download the software and you try to execute it and it gives you a module module not for an error, right? so So this is uh The problem with the industry solutions that their approach Basically about the problem a specific problem So we want to also look at what researchers are saying for the past three years If we look at the major and prestigious software engineering conference Conferences people actually make quite a few complaints about the python's static analysis in the xc 2019 is Nominees considered the most influential software engineering conference the The researchers complained there are limits to the tap inference so that they cannot infer the dependencies for code snippets This is actually very useful because in many cases we Get the software from the open source project. We actually want to reuse it if we cannot Get the dependencies for the code. We have already obtained from open source project. They'll be that will be Meaningless for us. So right after that there is a research from myself it's also We in our study We show that there is a really a need for helping python developers to avoid the usage of deprecated apis because If you use deprecated apis maybe One year or two year or even half a half a year after that the apis are removed from the latest packages So and this really costs a lot of trouble for your maintenance work And right after that we see a very interesting and very promising work about core graph generation This is about to understand how function calls are represented in the pattern source code But also also complained that in their work that we they have to ignore the conditional loops, but this is very commonly Useful pattern syntax features. So lastly this year we see a very very Strong comments that python cannot take advantage of analysis and algorithms that have been developed throughout decades of research So this is a personal. I think this is a strong comment And now if we compare the python and other two major program languages java and the sea and these two languages are the major are normally used for software productivity, especially for very big sections like oracle and google So they are really highly relied on these two languages But if we compare their there were analysis algorithms for java and the sea we see We can clearly see that the three address code ir immediate representation It's a data structure extracted from source coders so that we can build upon For better for other tasks. So it's not available for python language And even we see the core graph as I mentioned the core graph is a very important concept in the In software analysis It's the core graph construction is an open problem until 2021 And lastly, I would say that so far we haven't seen a static framework such as java suit And this is a very very famous project So two trends are so matured and sophisticated in java and the sea languages But why is this so hard for python now we have to Look at Why it is so because the language is so different first of all it has the nested scopes Nested scopes means you can define a function inside a function inside a function and in this way So when I scan open source project, I even found a project that actually have the nested steps of seven times So that means they have function inside a function inside a function for seven times I really wonder how can I develop manager? How can he memorize the structure? So the second one is higher order functions. This is not a feature for other two major languages Here all the function means you can you can use a function as a variable for instance You can define a function and you can pass it to Another name so that you can use that function This is very practical and very convenient for scientific computing because I myself I worked on scientific computing Very very convenient But but what makes it so convenient for faster prototyping makes it so difficult for analysis Let's look at this code snippet. You have to define two error functions to compute the error of different shapes But when actually a function happens happens, how can we know that which function? It's actually invoked, right? So the last the second feature is dynamically typed variables the variable has no type or in particular strictly speaking that Variable has no type only the value can be typed in pattern language and names variables are actually names so dynamically type means the developer have to Have to write many different cases so that we can we can verify if the if the input are Belongs to a certain type and pattern so called a duck typing, right? So variable types can be can change So this feature is also not part of c and java which are strongly Which are statical type down languages and lastly involving syntax Yesterday right here. I listened to a very interesting question for the pattern core developers Is the involving syntax for python is in the right direction of python ecosystem? So the question is very I think it's very good for the pattern core developers And the core team says we believe so we believe it is in the right direction But I wouldn't agree in terms static analysis because fragmentation issues are really a nightmare for for us So involving syntax you see the pattern has releases releases every year So it's just like a smart phone apps rather than a program languages, right? So what makes convention analysis? Applicable is python is so unique and different from other two languages So here's a code snippet that even a function get for every function We have defined twins for different shapes when the function co-action that happens How can identify which function is invoked? So if you are interested you can think about this one but What makes it so harder to analyze also Is the situation when I started my phd study in the 2020 at that time? I was just the first year I realized that when I develop approaches for solving some challenges like dependence analysis I had to build everything from scratch So why not I make it an open source project that that was my idea So the idea was to offer as many possible functionalities as possible for developers so we have The core level that we hope to offer the very critical algorithms for static analysis after that we can build upon some modules like api name qualification so that these approaches are very practical To solve some particular problem like api studies We want to know that what apis the python program has accessed So with this one and you can solve the dependency issues like if you actually has If you have the knowledge about what apis of python program have accessed Then it it is easier for you to know that which depend which dependent libraries you are going to To detect so this is the overview of our framework And the objective as a set is we hope to offer a set of functionalities such as control flow graphs and defined use relations scope analysis code rewriting to facilitate the program analysis for pattern for the next I will explain the Name error as an example and I will describe how we can use why the four Four modules I'm listing today are important So now if we take the previous example Again that we have a function named toy and if case is one that is zero if case is two b is zero C is equal to a plus one after the if statement and return a so now we know the problem occurs in the Assignment for variable c right so a common story is so you develop a you develop an application and so you feel like there are two Two packages are really useful for your project. I think the pattern developers always love this This product at this dependent libraries and now we feel like they actually dependent on the same library but just different version constraints unfortunately, but even so you have resolved the dependency, but This is the story like finally you give you see the name error problem. So the name error itself is not so toy Why because we observe this frequently occurred in the very popular patent patent library such as tqdm as I showed here tqdm is very popular among the deep learning developers. It's it's going to show a progress bar for your for your program and But library itself is has more than 100 contributors and used by more than 270 000 projects That means this bar actually propagates from the open source project to your current program So we can summarize the program scatter among patent os of projects But even so, uh, I guess some of you may be if you work on data center You often use the computational notebooks Jupiter lab, right? I heard that some people are using that Uh, are there people using Jupiter notebook? Okay, thank you. Oh, we write three years ago that there's a study pointing out the amount one million computation notebooks Name error is actually a major problem and this this study also confirmed by another Research by computer science research computer science education research name error is also a primary problem for Python beginners, especially for students. I guess Although it's so simple the variable is not defined This is actually not a problem in c and java because compiler will do that But Python code is just a execute and even it happens. You don't know where the problem triggers now if If we if we see that It if we see the code is snippet in a graphic way We build a control flow graph the control flow graph the graph actually represent all the possible execution Execution paths for the given program We are able to know that there's a certain path that if you if the program control flow goes along the path The problem will occur. So right here it goes to from if and then if If case is two then if case is not two then we see the name error occurs So building the graph will be very essential for detecting such such problem but Only by this by doing control flow graph is not it's sufficient because in real world scenario a control flow graph We will have more than 1000 in those even even even much Much more so this is not a possible way for us So we have to do different use relations offered by scuba as well So they try to answer the question. Where does a variable depend? so we have a lot a set of techniques such as Starting single seminal constant propagation, which I'm not going to discuss too much today But we can say that the relationship between all the variables and their definitions will be very critical For detecting software bugs especially from name error because apparently if you cannot find the definition for a certain variable That's name error, right? so the next one is Sometimes we don't often just want to know where the problem is we want to know that we're the Where is the code of line that actually triggers the problem because I've already shown you that the problem occurs in the if statement But in in a very large scale projects, we also want to know that where is the function code, right? so The the reason we want to know that is this can be very essential for the idc supporter Now to to understand this information. We have to go to scope analysis pattern language has so called le gb rule local In closing global building as I said pattern language has a very special scope design the Anastasia scratcher So we borrow the concept of the from latest research from the program language theory community So we build something called a scope graph and based based on scope graph It works like this first we look at the toy and we find is a local scope So we identify there is no the the name itself is not produced or is not declared in this scope Now we go to the in closing one and it's not declared there So we go to global one. So finally we found it but things are not always were not always working in The one that i'm discussing i'm speaking. So we in the internal representation of scalper framework. We have the so called a scope graph So in a scope graph we represent lexical scopes in a graphic way that you can see the different relations here Let's say the case is declared in the scope of function and tori is referenced in the scope of function and the scope of a class has the parent relation of scope function modular function is the root root scope here So finally we can map the location of toy the name itself to its definition And lastly we don't we we not only want to detect where the problems are we want to program repair right We don't want always fix bugs by ourselves. So automation will always be the best way So scope of framework also offer a set of api so you so for you so that you can rewrite your programs Not just for automatic program repair, but also for program transformation. That would be another Another interesting applications. So the transformation in the scope of remote would be directly made to the abstract syntax trace So the purpose is fixed some errors automatically. So lastly, I would like to show you that we A summary of our project our project has more than 11,000 lines of code already and this is Some statistical results. I'm very surprised after making its After making it a publicly available for the community half a year just half a year We have received more than 100 stars. Maybe for data scientists or maybe for Very popular topics. This would be very easy to achieve but for study analysis We don't often see there are so many people who are interested in such projects So I'm really actually surprised. So I really appreciated the recognition recognition from the community and Lastly, it's it's also where I came here from Australian We hope to listen your opinions on our project So do you have ideas for the such a static analysis work if you are from technical background? Do you have? Uh analysis algorithm that you feel like that'll be better for our project I'll be really glad to talk with you or you hope to work on the oh You hope to contribute to this project even about a code review because we are from academy We are not really good at managing open source project so any Voice or any feedback from you will be really appreciated by our team so Lastly, I'd like to Thank all of you. Thank you for your time. I really hope you enjoy the talk. I'd like to take your questions next Okay, guys, let's give a warm round of applause for joey Thank you very much for that presentation. Thank you. I think it's really interesting to see that tension between the work we're seeing in the seapython runtime with all of its new features and expansions Together with the desire to Build static analysis and improve code in that way. I see we have One question member from the audience. Yeah, I have a couple questions So my first question is about the you said there are like four python static analysis tools So which tools did you count? Okay, uh problem is to this part It's actually uh from wikipedia because it is really difficult to give To give the total number from no matter from where so I simply use The data from wikipedia. I think that's a standard Yeah, because I know there's like pandit Yes, open source. There is code ql Which is from github and close source like sonar or checkmarks which are apparently Proper territory as well. Yeah, but in that case if we count all the Number of products from the community. We have to count for other two major languages as well. So as to be fair That that makes sense. Um Yeah, uh, do you actually have some rules implemented there that could be used like in a CI CD pipeline or maybe for some security testing or Things like that At the moment, um, we don't have for security Uh for for security rules at the moment Yeah And uh regarding scope analysis, do you also support those keywords like global variable or non-local variable? Do you consider this? Yes, the statement we will consider. Okay. Yeah, so yeah, that's all questions I have. Thank you Thank you very much Are there any that are Are there any other questions at this point? If so, please approach the microphone here Hello Thank you for your for your presentation So let's ask you also if you did some performance evaluation of your Algorithm for the call graph construction or the mean evaluation Performance, so if you the the time performance Are the other performance? Yes in terms of In terms of execution efficiency, it's like for For the name arrow detection itself because name arrow isn't right now detection is a part of my research We have to release that data to the open source project For some academic purpose At the moment, I mean, but we are going to but I can see that for average the open source projects we scan it like 0.17 0.70 seconds per source file. So I think it's reasonable Okay. Yes. Thank you. Thank you Yeah, but I think this is a really important question because second analysis must be scalable Otherwise it doesn't make any sense Yeah, thank you Are there any other questions at this point for joey? If not, then we'll round off this session with another round of applause and a thank you for joey