 Okay, so we're going to start. So the first speaker, I'm very excited because basically every paper that I read has some mentions to it. So it's pretty good. So we have Miltus for Microsoft Research, a round of applause for him, please. Thanks, Rene. Thanks for waking up so early on a Sunday. So my name is Miltus. I'm a researcher in Microsoft Research in Cambridge, in the UK, Cambridge. And overall, I've been working in research and machine learning and software engineering program language research around this area for the last maybe seven years. So what I wanted to discuss today is more, give you a brief overview of some of the research things that are coming up. There are not necessarily things that will be applicable, things that you will go home and use them right away. But I think it's kind of give us an overview of where we see things going on. And now, of course, the first question when it comes to mind is why can we even use machine learning for source code? What is the thing that allows us to understand, to use machine learning to get aspects of source code? And the answer here is that we have this property that we call as research community, some of us in research community call it a bimodality. So it's the notion that, yes, of course, you write your code to tell to your computer what to do to your GPU, to your CPU, what is the exact instructions. But within the code, because you write it for other people to read, to understand, to extend, to maintain, to debug, you have all these kinds of hints that you give from better variable names so that people can understand what a variable is trying to do for good method names like function names, all these kind of things. So essentially, source code has these two forms of audiences, the machine, of course, and us humans. And because of those things, because it's so costly for us humans to read, to understand the code, we add a lot of, let's say, human level information. So, yeah, it's, I think I muted, yeah, yeah. So, a lot of human information. This is, you can imagine, patterns both in the way we think, the way we write code, these are patterns that we add in our source code. And because we have patterns, we can do machine learning. So maybe the poster kind of story of machine learning for source code is code of the completion. Both Eclipse Visual Studio and Visual Studio Code. Now, when you type code, when you do text dot something, path dot something, instead of just offering you a list of suggestions that are valid for this location, they will learn about your context or about the context that you are currently typing in. And they will say that, well, given the current context of the code, let's say here, you just replaced something with a path, now you are inside an if. Well, probably you're trying to see whether the path starts with something or ends with something. So, essentially they go, we go on GitHub, we scrape all of GitHub, let's say, and we find what are the common patterns of how people write code. And you can get this in a form of auto-completion where instead of getting this long list of all possible options of what are we going to do here, you get them sorted by some probability, probability based on how people have used code before. So this is currently deployed in many places and you probably can use it already. But there have been other things that have been going on in the last maybe five, six years. So one thing I wanted to show here is this idea of predicting type. So JavaScript is very common nowadays, as you probably know. And there are a few pieces of work that essentially what they try to do is they take an untyped snippet of code and they try to predict, let's say, the most probable types for your code. And you can imagine you have a for loop, maybe that's too small to see in the back, and you use for i equals zero. Well, you probably know that i is an integer. So you get these patterns that you can learn from data and you can start predicting types so that you can start migrating, let's say, from JavaScript to TypeScript to Flow or to other typed equivalents. And this is an interesting part because it means that machine learning is starting to understand the nuances of how we use the code and what things mean about this. And also adding types is an interesting paper that came out by some colleagues from UCL and Microsoft Research in the US where they found essentially that if you take a JavaScript program and you manually add type annotations, about 15% of the bugs that were eventually caught in somehow would have become if you had type annotations to start with. So you see that we're getting this idea that machine learning kind is moving us, moving the needle in cases where we have some form of ambiguity. We don't necessarily know what the code is or what the functions are doing in JavaScript, but we learn those kinds of hints statistically about their types, which we can use later to find bugs. We also had recently some other work again, mostly researchy things, not something that you can apply necessarily in your everyday life, where as humans, we write code, but we often use a lot of our primitive type. So you can have a string which represents a password and you define it of type string, password, or adjacent string that you still define as another string or something else. So now what this means is that we start getting these things that we have a latent notion. We don't write it down explicitly. It's hidden behind our mind that we shouldn't probably assign to a variable that is called password and it's a string adjacent thing. That would sound odd. So one thing we looked at is, well, how can we essentially use the concepts, the names of the variables, the names of the methods, the structure of the program to go towards splitting those things and eventually catching bugs? We're not there yet, but I think this is kind of, again, it points out that these natural language information, the names of the variables, for example, is very useful for doing program analysis that we care about and understanding the code in a way that we can help the developers in a semi-automatic way. There's also this interesting tool from Google. As far as I understand, they use it internally. And what they're doing is that they want to say that we want to detect argument swappings. You have a function that, like, take Java's or C-Sharp's substring and you say that it's, you want a substring and you say string.substring, parenthesis, and then you say first the offset and then the length, or maybe it's the other way around. So what they've tried to do is that the, when you define a function, like substring, you define it the formal parameter, the formal arguments are length and offset. And then you pass in a variable that's called off and size. So you've just swapped them. And what they're doing, and these are real Google code, as far as they say in their paper, that the official function, the declaration, says that the first argument is response, the second one is frequency, the third one is this list. Let's not bother. But then when the developer invokes this function, they don't give the name of this function, the developer used frequency first and response second. And you see the types are, again, the same. So the type system says, yeah, that seems fine. But this, there is an inconsistency between the formal and the actual parameters. And essentially this were swapped. So again, more and more kind of information from natural language, from these soft aspects of source code. And this is where, again, machine learning comes in. And you can look at this paper in this location. So overall, I think the broad perspective here is that where do we see that machine learning for source code will start appearing? There are many cases where we want to infer the latent intent of the user. That's the autocompletion case. We don't want the user to tell us, well, I'm typing here, and what I actually want to do is that. So now help me. But the intent, their intent is latent. It's hidden behind the keyboard. And we want to use a machine learning model that essentially infers this latent intent. What does the user want to do? Or what did the user want to do in case of finding bugs? There are other cases where we have ambiguous information. This is most commonly natural language, which is by definition ambiguous. And when we want to get the information in an ambiguous form and understand it in some way, this is where we want to use machine learning. And then, of course, heuristics and our code that a lot of us use because we have to. And in many cases, we want to replace those things with a machine learning component that is adaptive, that is learning as we go through. So in an academic kind of sense, I've written a survey of the research things that have been happening. SourceTotal has a great, awesome website on GitHub with its links. But this is a very broad perspective of, let's say, where research stands. So I want to spend the next maybe 15 to 20 minutes going deeper into one specific kind of work we had in Microsoft Research, which is called detecting variable misuse bugs. So the high-level idea is the following. Let's take this lever. This is from an open source database system in C sharp. And what I'm going to do, and this is the same game I'm going to play with at my machine learning algorithm, is I'm going to blank out this variable usage here. Now you can read the code, maybe not in the back, but the idea here is that you define this class variable and somehow that you get the values, then you assert it's not null, then you define the first variable, you get it again somehow, and you assert what is not null. And you continue your unit test. Now, obviously you would probably say that first is what should go in here, because that's what probably the developer intended to do. But probably because they got here, they took the first two lines, pasted over those two lines, changed a few things on the first line, and then they forgot to change the second line. So now this is something that is very odd, and this is something, a real bug that our system actually caught in source code. You would say that yes, I can use a linter to do this, and yes, of course, you can create a linter rule that says, well, I have assert not null, and well, I haven't changed class again, so I shouldn't be checking this. But the problem here is that we have so many rules that one would need to maintain if you want to do this in such a fine grained way that this would be very hard. And maybe the specific case happens in another, after another million lines of code. That's not really good. You can say you can write formal verifications, that's something that the programming language research has been doing a lot. So mathematical formulas that verify things, or you can run unit tests, but no one tests the tests. So in that sense, that's, we need machine learning to catch these odd mistakes that maybe, you know, you can see it. It's not, once you say it, it's obvious. But in many cases, when you're writing code, you're stuck into this mode and cannot see some things that once you find them, yes, I spend my whole day, obviously instead of I should have been J, but you know, these things take time. So can we use machine learning to improve this? So the idea is the following. We start by blanking out this variable and we're going to say, we're in C sharp land. You can imagine doing this in other languages too, of course. We're going to say, well, given this location from all the variables that are in scope at this location and type correct, so the type system would not complain, which one should we place here? So in this case, this is where only these two options, so our system needs to pick among those two options. And as I said earlier, this is not something that is easy to catch with a static analysis tool. So here's another example that when we dog-footed things within Microsoft, within the IntelliCode program and you see here again another similar example, you create a rectangular from some coordinates. You say X1, Y1, and then the height and width. Well, it's X2 minus X1, Y2 minus X1 again, that seems off. So again, you can get this kind of analysis and trying to get this into developers. Yes, this is, once you see it, it's obvious, but in some cases it's not. So now the question and I'm going to go slightly deeper into the machine learning aspects of things is the question, how do we attack this program? And if you look at this simple snippet of code, the first approach that research do is, let's treat this as a natural language. Let's take that this is a series of tokens, just a big long sequence. And from that, let's try to apply to apply some standard machine learning methods. Now the problem with that is that you lose a lot of context. You lose, for example, that this if statement is within this for statement or that there is this i variable which is used to iterate. So you lose a lot of the structure. You can also, may also want to do this. You can parse the code and compare to natural language and NLP and natural language processing. Machine learning has been used quite a lot there. Parsing code is an ambiguous, most of the cases. So we can parse the code, create a tree and get the abstracts index tree out of here. But again, we're still missing something. In the other day, what we want to do is we want to explore this very rich form of structure that we have. Things like, you see here like data flow. Data flow is very informative that well we have i which flows from here to here to here to here. And then maybe once we do another loop then we go back and so on. So the idea here is to encode a program by a graph and then use a relatively new machine learning component called graph neural networks that essentially can process and understand graphs. So let's discuss how we get to those graphs. And again, the many things in machine learning, especially in deep learning are design choices. So the way that I'm going to describe the graph is design choices we made. This don't mean that they are the unique and the best choices to make. This means that in this design space, we pick this point. So let's start constructing a graph for this very, very simple snippet of code here that came up from the previous slide. Well, first of all, these are my tokens. I can connect them into a boring chain of, well, assert.now, like here, a boring chain and a special type of edge called next token. That's a bit boring. We can also parse things and yes, that's unambiguous. So we can create the tree and we can connect all the nodes here through this extra type of edge called AST child, for example. So now we have encoded the syntactic parts of the graph but still we are missing many, many things about the semantics of the code over here. So forget about this edge. It will be in the graph but I will stop showing them in the next slide and we'll go to a slightly simpler example. That doesn't do anything real but it has a loop and it will help us to describe how to construct these semantic features within our graphs. So the first thing is we can add an extra type of edge. The previous ones are still included here. That is called last write. Given a specific position within our program, like this Y, when was the last time that Y got written? So in this case, Y is just here. It was written just once but if you are on X, for example, well, the first time that X might have been written is, well, it depends on really where you are. If you are here, the last time you wrote X could be, if you just entered the loop, it could be just here but if you're still looping, it was the previous time you're here. So you encode this information within the graph. Again, just one way to encode this information. The same thing with last use. When was X, for example, last used in my program? Again, the same thing. If we're just entering like Y, let's take again, that's the simplest one. If we just enter the loop, well, it was this instance here but if we are looping, well, it's itself. So we can construct this complex and more complex graph and add more and more of semantic features like computed from, you can imagine adding more of those things. And what happens is that in the end of the day, we have encoded as much information as we think we care at least in our graph about our source code. So if we look like a program, like a simple graph of this, for this example, the graph would actually look something like this. This is not meant for you to read but you see this already becomes quite complicated. So there are a lot of things to parse and what we hope to do is to use machine learning to answer the variable misuse problem I was discussing earlier. I'll get to the point how we do it in the next slide but on average, and this is a very simple example, our graphs, each graph is a single example, a single piece of code, a single variable misuse instance and you get about 900 nodes for graph and about 8,000 edges. So the graphs are not huge. They're not like the Facebook graph of the billion people but they are not small either. So this is the kind of the problem setting we are at and this is how we try to encode programs within graphs and of course the goal here is to take these graphs, push them into a machine learning component that I will discuss in next and then get magically our answer. So how do we encode specifically the variable misuse problem and other problems, other issues can be encoded in different ways. So we had this problem here. We blanked out this variable so we want to create one graph for this case here. What we're gonna do is that, well, we need to predict class but that's not necessarily the point here. We want to replace this syntactically, whatever token was here, we replace it with a slot node that is placed in this location. So now magically we removed any information about what this variable contained here. We don't know what was originally here. We don't know what it will be here. This is going to be our task. Let's try to predict that. And we're going to create extra edges. We're going to cut these candid symbols as we call it. So for everything that is good be in scope and type correct within the slot location, we're going to connect it back to the rest of the graph. We're going to create one node. So in this case we have two things, first and class and we're going to do a speculative data flow analysis. We're going to say if first was in here in the slot, how good data flow around it? If class was in the slot, how good data flow around it? And we connect everything here and there. So now we are, so this gives us essentially a form of an objective. What we want to do is we want to learn something called a representation in machine learning and I'll distribute the vector representation. I'll get to that in the next five minutes. So such that the representation of the correct variable first is as close as possible to the representation of the slot variable and as far away as possible from the class variable. So this is what we're going, we're trying to do with machine learning. So this is the problem setting, our data is graphs. Now the question is what's in the machine learning toolkit that can help us with this problem? So in the next few minutes I'll give an overview of this machine learning components. I won't go into great details because there is not sufficient time. So in the beginning of machine learning people had this idea of local representations. You have a huge vector where everything is zeros but one element, one component is one. And now this helps us discriminate that if the first element is one then our item will be a banana or the other one will be a mango and so on and so forth. But this is not necessarily the most efficient way of learning this. So with machine learning, with deep learning specifically we've moved the two towards these distributed representations, representations that are learned across our data. And the idea here is that we have a much smaller dimension here but these vectors are real vectors in a d-dimensional space. And what they do is that they encode in each of their components some of their attributes. So the meaning is distributed across the components whereas here it's localized. So we can get these distributed representations and you can get this to be anything maybe you have heard of Wartovac. Wartovac is one of the distributed learning representation methods. So now we're going to graph neural networks. And this is essentially the core component that allows us to understand graph, to use graphs with machine learning. So at a very high level what the graph neural network is is that you have a graph representation of your problem. That's up to you what it is. Also you have an initial set of information about each node, local information about node A, B or C and these are distributed vector representations I showed you earlier. And by the end of whatever the graph neural network will do what I want to do is to get representations for those nodes that say not to give them information not just about a node but how it belongs within the broader graph. So how this is done? This is done through something called neural message passing. The idea is let's take this part of the graph here that we have F, D and E. You have the representations over here. And the idea is that F originally has a current representation. It's current state if you wish. It gets as input, it gets messages from its neighbors. We can combine them somehow. Doesn't matter at this point. And you can update your current representation. So at this point as a node if you're a single node you're the node. You receive information from your direct neighbors. You update your own state. That's a graph neural networks. So this is maybe with a bit more concrete equations. I don't think we have a lot of time to go in there. The main component is this GRU. This is a type of an recurrent neural network that is about updating states and giving more information. So this is a graph. And at the first time step my node here have just received information from its distance 1 neighborhood. The same thing for every other node. Because all nodes send and receive messages synchronously in this version of graph neural networks. And the next step, this thing broadens. Now my node 2 received another message from its neighbor. But its neighbor has already received a message from its own neighbors. So now it has information about its distance 2 neighborhood. So in that sense we are getting more and more contextual information as we keep repeating this neural message passing algorithm. So another way of viewing it is that you have the graph. You have it through time steps. And what happens is D and E pass messages to F. They also may receive messages. And this keeps repeating again and again in time. And what the idea is that at the end of the day what's happening is that you're stuck here. The output of the graph neural network is each node has a representation, a distributed vector representation. And the goal here is that you can do anything with this. People have used it for many things like saying which of those nodes should they pick. For example, that's similar to our problem. And like node classification, should this have a label, should it be blue or red, things like that. Or managing like averaging everything here or summarizing it somehow and saying does this graph have property A or X? This has been used a lot in chemistry about classifying molecules. And molecules can't be described within graphs. Going back to variable misuse, the idea here is that we go on GitHub, scrape C sharp code, blank out each variable, ask whether any of the in scope type correct options which one should we use. Assuming there is more than one. And what happens here is that we do get an accuracy of about 85% on projects where we've trained on a portion of them and tested another portion of them. But we also generalize quite well compared to some baselines. So maybe this is a snippet of code, a real snippet of code. And this is the task that our model needs to get. So we have blanked out this variable. And now the question is from the variables that are in scope and type correct, which one should we use? So here there are string variables, three of them, base directory full path and path. Now I won't ask you to, it's because it's fairly early morning to usually ask people, can you think about this? If you think about it, it's not too hard to think about this. You need to reason a bit about the data flow, how things are used, and then you come up with full path. And indeed that's what our logarithm also says that full path should be here. It does that in a few milliseconds though. So in that sense, that's something that helps, gets us closer to understanding source code. So this is essentially how our model kind of works in many cases. And of course it makes mistakes in some cases. So we dogfooted this internally within Microsoft, within this IntelliCode program. And this is a real bug that we caught. Someone was logging something, but then here they decided to use, well, added existing document, really? No, you wanted to use added new document here. So you get this kind of errors that people make. And in many cases, this is a good thing. Overall, I think we found, we learned a lot of lessons, and that's my next slide, but we've decided to discontinue the dog footing that has been happening for the past year. So the main thing is that first of all, we haven't solved the user experience. We haven't solved how to communicate to developers that maybe don't have any experience with machine learning, that our decisions, first of all, are probabilistic. As you saw here, we had a 92% confidence, but it's never 100%. It will never be with machine learning. And also at the same time, it's really hard to tell people you're wrong. So in that sense, we need to be better as a community in finding ways to say this. Of course, there are the questions of false positives. How do you explain false positives? How do you let the users believe that in some cases you have false positives, that's fine. It's up to you to judge. And of course, the developers don't want many false positives. The other thing is machine learning capabilities. Yes, in this local setting of reasoning about code, we get a lot of things. But it's not something that you can do at a much, much larger scale. So you cannot take your full program, your a few billion lines of code, use machine learning and magically understand all the billions of lines of code, how they contribute to a specific point in your program. So I don't think as machine learning researchers, we still have a way to understand and distill the special form that the source code data has, the structure, the size of all these things. So these are open questions that are still bothering us. Then there is metrics and I'm going to go this into the next slide a bit. But in machine learning, you have a loss function, you have something you're trying to optimize. You want to be as accurate at predicting X. You want to have the minimal loss over something. In software engineering, we don't always have that. What is how do you measure quality of a project? Yes, there are some metrics, but these are very high level things. How do you use machine learning? How do you measure things so that you can optimize them with machine learning? Again, that's another big and open problem. In the end, we leave in a low resource world. Yes, there are billions of lines of code, but you have maybe the source code for 10 operating systems. You don't have like ImageNet where you have a 100,000 images and you can use that to generalize to the next one image. We have just 10 operating systems, let's say code of database systems, and we need to generalize on the next one. That's already a problem. This number 10 is actually small because as software engineers, we're trying to be smart, we will use things, so that's a problem. Going back to the learning signals, which I think is the main question here, is we mostly are used in the supervised learning. We have some input data, we have our model, the spherical cow of our problem, and we try to make a target prediction. We minimize using input output examples X and Y, this should be zero, and we try to minimize some loss. Essentially, something that says, well, I want to be able to classify this as closely as to my real data. It doesn't seem that we always have this with software engineering. This is an interesting question, interesting challenge for research, for practitioners. How can we tweak our systems? What do we need to change? How do we measure things? Instead of a conclusion, I think is just to tell you that I think the promise here is that machine learning can help us create those tools that help developers by removing some small, maybe boring, slow tasks that allow the developers to focus on the actual software products that they are trying to build. I think the idea here is that we need to think of this as adding another virtual memory in our team, that initially with baby steps, we'll start pushing forward towards building greater software. So there is a lot of work to be done, but I think it's a very exciting area to be in, and again, thank you for all for being here this early in the morning. Questions? So the question was whether first of all, the graph neural network, how do we initialize the information within the nodes, and whether it's, let's say, robust to removing bits and pieces of the graph neural network. In our case, the way we initialize the graph neural network is that we have natural language information like the variable name, or the name of a type, or the fact that a node is a dot or something like that. So we essentially learn these distributed vector representations for all possible names, all possible bits and pieces of whatever this node represents. In the C sharp land, we add type information because we have that, if you imagine a Python or JavaScript land that we wouldn't add that. So there is a lot of flexibility, this is our design choice. So now with regards to the other question, in practice, if you remove things that are far away from the point that you care about, because in this case, we care about a single point, a single place. Yes, that will be fairly robust. Overall, there's a lot of research on adversarial attacks on graph neural networks. There are cases to trick them like most neural networks. So they get the robustness kind of our eyes depending on the application and the data you have, the training method and so on. So the question was why we discontinued the variable misuse dog footing I was saying. So the first thing was that it's the question of explainability. We cannot necessarily explain to a hundred percent of the developers why we made the suggestion. Even if it's a wrong suggestion, in many cases there is a good reason that there is a good suggestion. But I know it because I developed the algorithm and I understand it's internals. Most software engineers won't understand it or won't be bothered to read everything about my paper to understand why one single code review common is. So I think that's the main problem. So we parse the code with as an abstract syntax tree. We repeat the question first. Yes, sorry. The question is whether do we use just the parse tree or we use the natural language information. So the natural language or within our graph neural network. So the graph itself represents all structural information that is given deterministically. AST, data flow, control flow, all these kind of things. But the nodes have the information like names of variables, names of methods, things like that. This is essentially how we embed natural language information. We don't add things like comments, we don't add anything else, but variable names are very indicative. So that's the only part of the natural language part of the aspects we get. Well, I think we don't have time.