 Elwed me ok, That's good morning. Can everyone hear me away at the back? Yeap, great. Just introduce myself and company. My name is Mark Shannon. I have been playing around with Python for over a decade, and most of that I spent either working at how to optimise it or how to analyse it or how to break it as a side effect of those two things. I oeddwn ni'n rhaid o'r ddweud yng Nghymru i'w ddweud yng nghymru, a rhai o'r ddweud o'r ddweud yng Nghymru i'w ddweud yng nghymru, i'w uchydig o'r ddweud yng Nghymru i'w ddweud. Felly byddwch yn dweud, mae'n byddwch yn gweithio dwylo gyda'r ddweud, ac mae'r ddweud yn gweithio'r ddweud, oherwydd mae'r ddweud o'r ddweud. Felly dyna'r llaw gynhyrchu, In February 2012, it was on-flight to Mars and NASA found a bug in the Lander software. Now the Lander software's written in C and in C you can pass arrays around, but if you pass arrays around, the C language doesn't bother passing the size around, so it's very easy to pass an array of one size, to a function is expecting an array of another size. So NASA found one of these cases in testing a yw'r cyfnodd ar gyfer ddim yn gwneud o'r cyfnodd yn mynd i'w ddweud o'r cyfrith ar hyn. Wyddaw'n cael ei wneud o bwysig y cwiri 9-line ac mae'n gweithio'n gwybod o'r 30 eisteddau. Mae'r cyfrith oedd yn mynd i gael gweithio'r rhai, oherwydd mae'r tantoaint cyfrith yn gweithio'r cyfrith a'r cyfrith yn mynd i'r cwrydym, ond we'n amgymryd bod ni'n i chi'n gweld i'chillwn i amdano, ac mae gwybod ni'r gwahodwyr iddyntio i gyrwsio dda i. Felly Nasa'r gwaith erbyn yn stafaelonol, ond Nasa'r ymarfer hynny rysig, os wnaeth beth o ffyrdd y gwnaeth ymy, yma yw'r code-analysau erbyn ti'n f Georg, mwyn ychydig o heb i ymgyrch ymryd â Nasa. Dwi'n dod yr effordd yw'r arfer, rydyn ni'n bynnag oedd ychydig o'r cwm yn y cwyd, Mae'r cyfnod o'r cyfnod a'r cyfnod. Mae'r rôl ystod yn ei wneud yn ei ddweud ymddangos i'r ymddangos. Mae'n ddweud. Yn ymddangos, Nazer Cod wedi'i wneud yn C, ond rydyn ni'n ddechrau'r llangwys y LGTM, ac y cwrs, ac y pethau sy'n gwneud ymdill i'r cyfnod ymdill ymddangos, yn python. Felly, ychydig yn ymdill o'r cyfnod. Cyfnod ymdill i'r cyfnod yma'r ymddangos i'r cyfnod. felly allan o ychydig y gallu'r bwysig, felly mae'r metrach mae'r cyntaf o'r cyflexu o'r byd mae'r cyflexu o'r cyflexu o'r byd felly y gallu cyflexu o'r ffa gwylo mae'r bwysig o'r bwysig I would like to contrast code analysis with a couple of quality assurance things you probably already do. The first one is testing. I'm going to assume everyone tests their code. I'm not going to bother asking because they're not embarrassed anyone. Testing is obviously very important for checking that your code is safe to be released. Testing is very specific to a code base. Also, you need to write tests for everything you're interested in. If you write a new piece of code, you're going to have to write new tests for it. You can't rely on pre-existing tests or pre-existing sets of tests to find what you want. Whereas code analysis, often you can rely on pre-existing stuff. Another quality assurance thing you'll probably use is code review. Code analysis is much more like code review. A human code reviewer is going to look at the diff in your code and attempt to see what issues they can find with it. Hopefully they'll do it in a positive constructive fashion, but essentially are looking for flaws in your code. They might be looking for design flaws, but also looking for smaller scale errors. Code analysis can take away a lot of that work. Code analysis can find those errors and it's much more meticulous than any human could ever be. It can also work out where your changes interact with other pieces of code and double check the interaction, which is something that a code reviewer could very easily omit. What does make for good code analysis? It's not very useful. Let's go through these one by one. It needs to be flexible. Given the NASA example, that was an error they had not anticipated. Had there been a general purpose check they already had, that would have been fine, but because they didn't anticipate it, we needed to create a new analysis on the fly reasonably quickly and reasonably easily. That's an important part of analysis. Another one is it's accurate. After all, imagine you have a watch and whenever you look at it, it's right half the time. What do you do with the watch? Well, you just bin it. What about if it's right 90% of the time? It's kind of useful, but not really that useful, but you probably keep it if you didn't have any other means of checking the time. But what about what is 99.5, 99.8% accurate? Sure, you double check if there was a flight to catch, but otherwise you'd pretty much rely on it. Assuming you can rely on something then it just makes your life easier because you're not double checking it or doubting it all the time. Accuracy is very important. Finally, it needs to be useful. It needs to be insightful. Pep 8 is all very well and good. We all love Pep 8, but does it really matter if there's 81 characters in a line? Or these minor little things? But whatever analysis can find things like cross-site scripting vulnerability or something like that, then that's really valuable. Obviously, there's a huge range of things that we can find. You want the code analysis to find interesting things. So, can we do this for Python? The answer is yes. A spoiler alert there. But it can be tricky compared with statically-tight language. Python doesn't have any type annotations or declarations. Of course, with type annotations, they are used, but there's relatively few of them. Also, because it has a history of being dynamically-tight, people tend to just pass values around and then locally check for things like, is something none? Does something have a particular attribute? Is it callable or so on before they do some operation on it? And we need to understand these sort of things. Also, people do things that are genuinely dynamic in Python. Things that code analysis tool is always going to struggle with. Things like creating classes from a database schema on the fly. That's pretty difficult to analyse. If you have the database schema to hand, then maybe you can integrate analysis of that, but generally you're not going to be able to do that. So, in order to keep things accurate, we also need to know what we don't know. I've said flexibility is kind of important. Our tool, LGTM, is what makes it flexible. At heart, it contains an object-oriented query language. The advantage of a query language is because it's declarative. You can just say, I'm looking for this sort of problem. That allows you to make fairly brief queries that will find what you want. Given the NASA example, I said there was a nine-line query. I'm not going to bother you with that because that uses the C libraries. I said we're going to bother with Python, so I will give you a Python example. Here's an example query. Basically, what we're looking for here is a for loop. The thing it's iterating over is not an iterable. This query is pretty straight, pretty short. Obviously, at first glance, it may not make a lot of sense. I'll explain how it works. The three clause is much like any SQL query. We have a from clause, which describes the program elements that we're interested in. A where clause, which relates them, and a select, which just gets us our result. In the from clause, we're looking for a for loop, an expression, a class, and an AST node. An AST node, at this point, we'll just say that's just some point in the program. We're interested in that so that we have some marker that we can look at when we see a result so we know what we're looking at and how to fix it. The key thing is, if we're not looking for any old combination of those, we're looking for the specific relation between them. The first thing is that the expression is the thing in the for loop. That's our first line in the where clause, basically saying that the iterable in the for loop is iter. The next one is probably the key point. That's basically saying that that expression, what it refers to, or the set of values it could hold, we don't care about the value, but we are interested in the class of those value and the origin as we know where it came from for producing results. The last line says, obviously, that it's not iterable, but you'll note that we're not saying that it isn't iterable. We're saying that we don't know that it is iterable, which is why we have the second clause, which says we don't, we do know something about it. I have a hand up. Sorry, underscore is a convention, meaning ignore this value. I believe that SQL conventions are wrong, but I don't reuse SQL, so anyone can correct me later if I'm wrong. We don't care about that value. Then we select the loop and the origin, which is usually useful helper to tell us where the value came from so we can fix the errors. Flexibility is that you can write these brief queries and you can write your own queries. What makes it accurate or precise? That refers to what we had in the previous slide. That's basically wrapping all our analysis in the library, so I'm now going to go through the library. Let me check my time. That's good. I'm going to go through some examples and I'm going to show some example code. It's nice to show real code, but there's a couple of reasons not to. Real code is far too big, won't fit on a slide. Usually what tends to happen is that the cause of an error and the manifestation of an error are not necessarily in the same place. It makes it awkward to grind. There's another equally important reason, but I don't really want to choose some arbitrary piece of code and point out and say, look, there are bugs in your code. It's much better that the finger points at myself and the code is clearly intentionally buggy, as you can probably guess from the name buggy code there. Hopefully I won't upset anyone. The first piece of analysis we can do is to parse the source code and produce an abstract syntax tree. An abstract syntax tree is basically a tree that describes the structure of the source code. In the very simple code on the left produces the abstract syntax tree on the right. Assume this is a whole module, not just a snippet of code. The tree on the right, the top level is the module itself, and that contains two statements, an assignment and a for loop. The assignment is broken down into the target of the assignment, which is a left-hand side, which is numbers. The numbers is just a name rather than a string, so it's a name and then numbers, and then the value is just the value one. This is abstract syntax tree rather than what's called a concrete parse tree. A concrete parse tree would contain things like the parentheses around the one, which we've omitted from the abstract syntax tree, because it doesn't affect the meaning, it's just extra syntax. Likewise, there's no actual marker for the for or the in tokens that are just omitted. The for loop is a little bit more complicated. The target, again, is the thing that gets assigned, which is n, the ita, which is, if you recall from our query with the loop.getIta, that's the iterator. And then there's a body, which is just a list of statements. In this case, one statement. That statement is an expression statement, whose value is a call, and the call calls a name, which is print, and has the args name n, and those dot, dot, dot suggest for... There's maybe a few other bits to do regarding the presence or lack of star args and star, star args and annotations and such forth. Except there's no annotations and a call. Forget the annotations bit. Okay, so, if we have a look at that piece of code and we run it through our tools, it tells us that, indeed, we have an error there. So n, so if this is a screenshot because I didn't trust the Wi-Fi, but I think I should have a... Where is it? Alerts. Non-iterator for loop. One simple. Why is this not working? Right, okay. So this is actually on the web. Yeah. So if I click on that, it'll highlight the origin, which is where the origin bit comes in. So basically we can say that numbers is an integer and you shouldn't iterate over an integer. Okay, so that's the AST. That's our kind of first thing we do in our analysis. I want to go back to presentation. So our next step is a control flow graph. If you look at the program on the left, you'll see that this one is somewhat redundant, but it's correct. The numbers we are iterating over this time is the 2.123. Now the AST would have just said we have two assignments called numbers and it doesn't give us any information about ordering. It sort of gives us information about ordering, but not accurate enough to be generally useful. Whereas a control flow graph does. So a control flow graph basically is a graph that emulates the way that the interpreter actually executes the code. So first of all, if you look for the... You see the octagonal elements, the module. There's an entry and an exit point for this whole flow graph. The orange one at the top is the entry and the grey one mostly near the bottom is the exit point. And then it simply flows through the code. So essentially what evaluates is we evaluate the constant 1, then we assign it to numbers, then we evaluate the constants 1, 2, 3, create a tuple, and then evaluate, assign that to numbers, and then we go through the for loop. And the for loop basically it says load numbers and then loop over its items, printing each other one at a time. I don't know how... It's reasonably clear. Yeah, the Google Docs won't allow you to put SVGs up, so I'm not sure the resolution of the PNG is great. Okay. So that's great. So we're no longer... We're not coming up with a false positive because we're choosing 1 when we should know it's a tuple, but things can get more complicated. So in this case, if this was just the code, obviously, we would stop as soon as we hit random with a name error, but let us assume that random is defined somewhere else as something that is either genuinely random or something the analysis can't work out whether it's true or false. It doesn't really make a lot of difference. Now, at the beginning of the second if statement, we know that numbers is either 1 or it's 1, 2, 3. So it's either an int or a tuple. Now, if it's an int or a tuple, we track it through and we're going to see some errors. So we're going to think that, well, in either for loop, numbers could be an int or a tuple, so there are both errors in both loops. But of course, there's only an error in the second loop because the first loop checks is guarded by the check to see if it's a tuple, so the one cannot reach that. But the control flow graph doesn't really show us that because the control merges. So what we need to do is what's called data flow. So data flow essentially, we track the values or the set of values or some approximation to the exact values, obviously, because we can't execute the code that gives us the information we need. In this case, because these are simple constants, we just track those. And what will happen, of course, is that our flow here, I should mention that if you can see, apologies for those people who are colour blind, but the green and the blue arrows, the condition corresponds to the case where the condition is true and the blue corresponds to the condition where it's false. And we can track through the value. So basically, as we go through the second branch, that test will eliminate one of the values on either side, such that the first for loop, we know that the value one will not get there. So we track the value of one from the assignment. And then as we go through the test and the test is true, obviously, one is not a tuple. So at that point, that value is discarded. And then obviously there's no error in the first for loop. But in the second for loop, obviously, that's just merely testing that it isn't a tuple. One isn't a tuple. So then we hit that and then there's an error there. So we can see that on our website. And you can see there's an error in the second loop, but not in the first as we've been able to work out from the data flow that we're not seeing. We're seeing an integer for the second one, but not for the first one. But, of course, there's cases where that doesn't work. So this is a slightly contrived case, but you do see things like this where people do a test, set some conditional value, and then do the same test later on and then use the thing they set in the first test, for example. And this can happen in try imports as well. You might see code that's like try import foo and then accept import error foo equals none. And later on there's a test that says if foo, because modules are always true, do something using foo. We need to track that. So here's the program and the control flow graph on the right. And our data flow is insufficient to prevent us getting a false positive here. Because flags and numbers are different. So all we know about numbers at the point we hit if flag is that it's one or the tuple. Flag is nothing to do with that. So both pass through that test and then we get a false positive. So what we can do is to split the control flow graph. So basically what we do is this transformation. So if you see, basically what we do is we don't rejoin the flow after the if statement and then we can basically duplicate and move the test, the if flag test into each branch. And then it should be obvious that it fairly trivially falls out which loop gets run and which doesn't, if false and if true are pretty straightforward. And then it's fairly clear that in the first, on the right-hand side, that loop is never going to be executed on the right-hand side, it's safety executed because numbers is one, two, three. Now obviously we don't do this transformation on the source code because that messes up everything else. So we do it on the control flow graph and here's the transformation. Now what the other thing to note here is that you might think this has a tendency to just blow things up horribly. It's actually not so bad. We do limit the amount of splitting we do in order to avoid it blowing up. But often the, once you split, you can often, because you split on cases where there are repeated tests, you can often then prune some of those branches. So you will note that on the right-hand side, the, on the left-hand branch, which follows the true test of the first place, we also know that the second test is therefore, can only be false because it's a, and on the other side it's the other way around. So consequently, we're able to prune those extra branches and you'll see the right-hand side is really no larger than the left-hand side. But sometimes it does expand. A particular case is where you have like an early test and then a whole lot of code and then another test that matches and then there tends to be some duplication. But generally we don't see much of an increase in size of the control flow graphs with this. So, well, so far so good. But you will have noticed that all of that stuff was very localised and data tends to flow around a program through calls and so on. So have a look at this code. So is this correct? Well, yes it is because we can either call in print numbers with false and an integer, in which case if it's flag is false we don't loop, or with true and a tuple and if it is true then that's okay to loop. So what we need to do here is if we track the calls from print numbers in the code of the bottom into the function that's not quite sufficient because what values can flag have in the call in the function where it could be true or false. Numbers could be one or the tuple and then we're unable to distinguish. So what we need is something called call context which is where we basically pass the context from which we call something and that enables us to disambiguate this. So basically in order to be correct either we need to not go track any values through calls or for a very restricted set or we can use call context. Again call context is something where potentially you can blow up the size of things we're analysing but we need to carefully limit that to try and find interesting stuff without killing performance too much and there's more. But we don't really have time because I could carry on like this all day. I think hopefully I've convinced you that our analysis is reasonably accurate as a result of using all these techniques. Let me just check my notes. Cool. So I now a quick bit over run over lgtm.com Here's the front page for Django. We analyse a large number of open source projects not quite as many as we would like so if yours isn't up there and you think it ought to be come and talk to me but generally I think if they're I think we aim to analyse most of the most stuff on GitHub and Bitbucket. I'll just quit you run through this so you can see the number of contributors the number of alerts which are things that some of those errors are other things are just sort of recommendations and warnings which are probably less important lines of code and a whole bunch of other stuff. I'm picking Django because it's reasonably good code so hopefully I'm not embarrassing anyone but I think pretty much everyone's heard of. Now another project, one slightly less famous and lower quality but I want to highlight that we do pull request integration so if you think this might be valuable for your project you want to look through the alerts and if you think you want to know about this every time you have a pull request we do pull request integration all you need to do is log in to lgtm you need to get a user ID and then go to your project and you can click pull request right now that's a bit of formatting now here's the sort of interesting stuff so as I said flexibility is key flexibility means you have a problem that I haven't anticipated and you would like to find other instances of that problem well, you can write your own query now this is a query language it's a custom query language it's declarative programming it's somewhat different to the sort of programming you might be used to so I appreciate that although these queries look very concise it's sometimes not entirely straightforward to get your head around it but I would recommend you have a go and so just to recap good code analysis should be flexible, accurate and give you valuable results insightful in some ways is up to you if you need to know something about your code and you can write a query to do it then that's pretty insightful okay, I think that is it so there's an open space on code analysis I want to invite other people who are interested and so on as well to that, but if you want to come along and chat to me I'll be around after the talk I'll be at the sprints and if you thought that was all kind of cool we're hiring does anyone end a talk without saying that but we are oh, Larry's put his hand up okay but we are hiring if you think this is cool and you'd like to work on this we're hiring for doing Python analysis from the web front end for the core infrastructure and if you want to do C++ or Java analysis or the like we're also hiring for those okay, I think that's it is there any questions so we have something like 15 minutes for questions and I already see some questions going over there I can go with the microphone either the room I just tried it in parallel and tried to add my project and it says I can't build it because some headers are missing so how is that stuff handled if you have non-python dependencies also did not find our requirements text because it's not in the top directory we have a requirements.d and in there are the requirement files because we don't have just one, we have multiple so okay, I think that's probably better if we discuss that offline but yeah, the problem is we do essentially just scan everything and if it's pretty standard we build it and if it's not we'll need a bit of custom configuration but I'll look into that because we're always wanting to fix these issues so static analysis is great especially if you're doing C, C++, Java any of those static languages if you're programming in those and you aren't using static analysis start using it yesterday because it will find massive amounts of bugs that otherwise you wouldn't but in Python the problem is that I've used several of these tools and whatever always seems to be happening is that you enable them because Python is so dynamic you get massive amounts of false positives and they just overwhelm you and all of the valuable things get lost so do you have any sort of numbers or estimates on how your false positive rate is? Well this is the accurate thing so this is what I was saying earlier about knowing what you don't know so to put it technically I would say that we attempt to ensure that our knowledge a set of facts that we're trying to present to you and which you can base your queries is a larger strict subset of the truth as we can manage that in mind should in theory be no false positives at all obviously we're not perfect so we do have some false positives but we generally regard any false positive as a bug with a handful of exceptions where to do the analysis completely perfectly accurately would give us essentially no results and we're prepared to trade a few false positives to get meaningful results on most projects but as to numbers it's kind of hard to say because we don't know how many errors there are in a programme so we can't come up with a number as to what our false positive rate is but as to say we aim for zero it's not an achievable goal but it's definitely something we take pretty seriously you have it as a service online do you offer it as a self-hosted service for example for companies that have for quality reasons for not putting their code online I basically answer is I don't do sales my boss has told me off trying to do sales because I'm an engineer and I'm not very good at it so don't ask me this is all free for open source if you want to use it commercially then you should contact us directly through sales so semall.com oops so when there are those new type annotations in Python 3 now do you also make use of that when you run the analysis okay so I didn't cover type annotations and we don't use much take much information from them basically we're sort of like it's generally you know the development is an ongoing process and we'll focus on what gives us the best improvement at the time so including type hints is definitely something that's on the cards and we particularly want to do it for analysing stub files for the standard library and so on because our standard library analysis is kind of just we did a one-off analysis of the C code and then tried to write some queries on that to generate type information but that's relatively weak compared with what's now in the typing stubs so we want to use those of course the worry is that any error in stub files is just going to manifest itself as a false positive so that's definitely a concern but yeah so the plan is to take advantage of those but not to try and do type checking as such as merely using those as sources of information for the more so general data flow thank you any other question do you have examples of bugs that were fixed in open source projects thanks to LGTM or even projects that embraced the tool not off the top of my head I can dig you out a list if you want some evidence later on but I mean going back a long time we found bug in the 2.7 standard library which is a kind of funny little one so in 2.7 if you implement done to equals but not done to not equals then the interpreter gives you different results for equality and inequality so I think it was a weak ref set in 2.7 a while ago 4 or 5 years ago I think if it was fixed it was both equal and not equal to itself we used that as a little demo and unfortunately I think someone from the core developer team was in that demo and they fixed it with an hour so we had to find a different demo and we found in various other projects so various things so interestingly we had something that looked like a false positive in Flask but it turned out that there was actually something wrong in the requirements.test for Flask it was saying it needed click 2.0 or higher and it was giving us an error that said you were calling it with an argument that didn't exist in the function of being called but when you corrected the requirements to click 4 then the error went away so sometimes you find rather indirect errors like that and it ended library in Flask Any other questions? Mark, great talk, thank you I'm just wondering, one of the slides there you appeared to generate code that was also Python but an improved version so I'm just wondering is it possible to save if you put in a project can you spit out improved code as such that we can then do? When you said improved code did you mean the transformation on the flow graph? Well I wouldn't say there's an improvement it's got duplicate code in it so there's definitely not an improvement so analysis tools much like compilers are going to transform the code and do all sorts of weird things with it during their analysis and some of the intermediate representations of code are going to be truly atrocious incomprehensible code but they make things more explicit as far as the analysis is can turn internally but you should leave the code as it is because there are errors in your code fix them obviously and if things are unclear make them clearer but that's not, you shouldn't really do that for the benefit of the tools Any other question? Well I do have a question so I will ask away so can you give us an idea about how many projects your company is processing in general and also kind of curious what kind of infrastructure is there anything about that that you folks have to run your system on all this project? There's some numbers So 50,094 projects apparently I hope that number's right Yeah it's kind of so this is including Java, JavaScript and Python probably roughly half of those will be or maybe even more JavaScript probably sort of I don't know, anyway I'm not sure of the ratios but it's JavaScript is the largest number then Python then Java but I think that's just because our proportion of projects we analyse at the moment is lower for Java because of the build issues and so on but yeah this number should go up and if we add languages it will definitely go up further Any other question? Going once, going twice going Well, let's thank the speaker again Thank you