 Hi, welcome to analyzing software using deep learning. So this is part two in this module on using sequence-to-sequence architectures for analyzing software. And in particular, we will now look into one of the applications of this sequence-to-sequence module, namely for API usage recommendations. So in this application, the idea is that we have a natural language query that describes how you want to use an API or maybe more specifically what you want to do with an API. And then what the sequence-to-sequence model is supposed to tell you is a sequence of API methods that you may want to call. All of this is based on a paper published by Goudal in 2016. So if you're interested in more details and there are many more details than what I'm covering here, then please have a look at this paper. So let's start with the motivation for this work. So why do we actually want to predict how to use an API? The reason is that using an API can be pretty difficult. So if you've ever used a larger library or a larger framework, you've probably come across this problem. So there are many, many classes, many, many methods, and it's not always clear what methods you need to call in order to achieve a specific goal. And even if you know the methods to call, it may not be clear in what order to call them so that you're using the API correctly. Now there are many different ways how developers can seek answers for these questions. One of them is to go to some web forum, say Stack Overflow, and ask the question, and then maybe get an answer from some other human who knows the API better than you and can help you. Or maybe there's already some answer that someone else has given in the past and that more or less matches your question. Now, here the goal is to not rely on humans to provide these answers, but to automatically suggest API usages for a given natural language query. So basically, for a given formulation of one of these questions that you see above. And the idea is to automatically suggest these API usages using a neural model and specifically a sequence-to-sequence model that has been trained on some data. So how to do this? How can we automatically predict API usages for a natural language query? In this work, the idea is to formulate a problem as a translation problem. And specifically, the translation is from some input sequence to some output sequence. And the input sequence here is a sequence of natural language words, so a description of what a developer wants to do with an API. And the output sequence is a sequence of API method calls that basically tells you, hey, you have to call these methods in this order, first call this, then call that, and eventually call that other method. And in order to make these predictions, what the approach does is to train and then query a sequence-to-sequence neural network using examples of input and output sequences for exactly this purpose. So let's have a look at the concrete example. So the natural language query that is given as an input to this model could be the following where we are basically saying, hey, I want to match regular expressions. I do not really know how to do it, but this is what I want to do. And I can say it in natural language. And one possible answer to that query would be this sequence of API calls where we, this is for Java, so we would call pattern.compile, then pattern.matcher to get a matcher, and then matcher.group to actually match a string against these, against a given regular expression. And knowing these three API calls that are likely to be what you want to do if you want to match regular expressions, a developer has a much easier time using this API than if the developer would have to search manually through all the APIs or maybe search the web for examples. One important question for basically every deep learning based approach is always what is the training data? So let's have a look at this question for this particular application of sequence-to-sequence models. So heated training data are Java projects that the authors of this work got from GitHub. So it's basically a lot of open source code. The API that this work is focusing on is the JDK, so the APIs of the Java Standard Library. And the reason for focusing on these APIs is that basically every Java project is using some part of the JDK, so there will be a lot of examples of how to use these APIs. So in total, the authors analyzed 443,000 Java projects or a very large number, giving them a lot of examples. And by looking at the API usages, specifically the usages of the JDK in all these projects, the authors extracted pairs of annotations that basically tell us some natural language words related to what a piece of code is supposed to do. We'll have a look at how this works in a second and call sequences, so sequences of API methods that are called as described by these natural language words. In total, the authors extract 7 million such pairs and then use almost all of them for training but keep 10,000 for testing so to basically evaluate how well this approach works. So to make this more concrete, let's have a look at a concrete example to see how this data extraction really works. So what you see here is an example of a piece of Java code. It's a method that was extracted from one of these open source projects at GitHub. And this method happens to have a comment that tries to describe what the method is actually doing. So it says that this method is about copying some bytes from a large input stream to an output stream and then it says a few other things. If you look at the code, you see a couple of calls to APIs in the JDK. Specifically, there's this call to read here and then this call to write here. And these two happen to be API calls. So by basically looking at the types of the objects on which these methods are called, you can figure out that these are actually calls to the Java standard library. So what the approach will extract from this example is the following. So it's a pair of an annotation and a call sequence. The annotation happens to be copy spites from the large input stream to an output stream, which are essentially the words in the first sentence of this piece of documentation. And the call sequence is input stream.read and output stream.write. So basically the calls that we see in this code given a little bit of additional information here were by adding the right types that are not as tokens in the code, but that you can find out by resolving the types that are used in this API usage. So now that you've seen this concrete example, let's look in some more detail at how this extraction of data really works. And let's start with the extraction of the annotations. So the natural language part of our pair of natural language and API sequences that we wanna have. So in order to extract the annotations, the approach looks at the Java doc comments associated with each method. And as a heuristic to basically just get the most important part of this documentation, it always extracts the first sentence of the documentation. So if there are multiple sentences, like in the example that we've just seen, then all but the first are ignored and only the words in the first sentence are considered. For methods that do not have any Java doc, the approach basically just ignores them because there's nothing the approach can extract from these methods. But fortunately, there are enough methods that come with some documentation and that can be used here. And as another heuristic, the authors ignore some annotations or some comments that are what they call irregular comments. So things like to do or fix me, which basically tell the developer something that may be relevant, but is not really a description of what is currently happening in the implementation of the method. So now you know how to get the natural language part of these pairs of annotations and API usages. Let's now have a look at the API usage part and how this is extracted from the given Java code. So the overall goal here is to have a relatively lightweight analysis because the idea is to scale this to millions of code files in order to get the amount of data that we've already seen. So what the authors do here in order to reach this goal is to use a static AST based analysis. That also gets some type bindings. So basically knows what types particular program elements have. To illustrate this idea, let's first look at an example, which is this little piece of code here, which is a call of list.add with the argument 23. Now, parsing this into an AST would give something like this for Java, where we see the different parts of this code represented as nodes in the abstract syntax tree. So at the top, we have the expression statement node because this whole list.add of 23 is a statement, which also happens to be an expression. So it's an expression statement. In there, we have a method invocation, which consists of a name, which happens to be list. Another name, which is the name of the called methods, which happens to be add. And then also a list of arguments, which here happens to just contain one token. And this is this integer literal 23. So now given such an AST, the approach extracts a couple of sequences or API examples, depending on what kind of code is found in the AST. So if there is a constructor call, so something like new C, this is how it would look like in the code. Then what the AST-based analysis extracts is something like C.new, if C is a JDK class. So basically it resolves the type of C. It checks whether this is a class of the Java standard library. And if yes, because this is a constructor call, it just represents this as a call of the new method of this class C. For a regular method call, something like object.m, what the analysis extracts is C.m. Again, only if object is actually an instance of a JDK class, which then basically tells us that on this class C, method M is called. So we do not really care about the name of this variable here, object, but what we do care about is that method M of class C is called. Sometimes you have code that calls one method and passes the result of another method as an argument, like in this example here, where the result of O2.M2 is passed as the argument 201.M1. So what actually happens here at runtime is that at first C2.M2 is called, and then C1.M1 is called. And this is exactly what the approach will extract here, basically telling us that these two calls happen one after another. Again, although this is only done if C1 and C2 are actually classes of the JDK. If there happens to be a sequence of statements separated by a semicolon, so something like this, where we have one call O1.M1 followed by another call O2.M2, then very similar to the example that we've just seen, we will extract or the approach will extract these two calls as a sequence. So C1.M1, where C1 is actually the class of which O1 is an instance, and C2.M2, where C2 is the type of O2. For conditionals, so something like this, the approach does the following. So let's say there's a conditional where some method M1 is called in the condition, and then depending on which branch we take, either M2 or M3 is called. So what the approach does here is to extract a sequence C1.M1, where again C1 is a type of O2, followed by C2.M2, and then C3.M3. Now, an alternative way of doing this would have been to extract two sequences, one for basically the then branch, which would just contain these two, and one for the else branch, which would just contain these two. The authors in this case chose to extract all three calls at once, which is not really what happens at runtime, but in a sense tells you in which order the developer is likely to write down these different calls. So for really writing down the API usage, this may also be a reasonable choice. And then finally, for loops, let's have a look at this one. So here we have a loop where M1 is called in the condition of the loop, and then in the body of the loop, we have a call to M2, and what the approach extracts here is these two calls, C1.M1 and C2.M2, where again, C1 and C2 are the types of O1 and O2. So overall, by doing all this AST-based extraction, what the approach gets is a lot of sequences of API calls extracted from the source code, and these come now paired with the natural language words that we get from the Java doc comments. And using these two pairs, we can now train a model that predicts the API usages for a given natural language query. So now you know the individual parts of this overall approach. Let's now just put all of this together to see how this at the end gives you a technique to predict API calls for a given natural language query. So all of this starts with some data set of projects. In this case, it's more than 400,000 Java projects. And this is given to aesthetic analysis that as we've just seen extracts two things. So by looking at the Java doc annotations of methods, it's extracting these sequences of words from the first sentence of the Java doc and it also extracts the call sequences by running this ST based analysis that looks at the types of the objects on which methods are called. And now this gives us a set of pairs. And now this set of pairs is used to train a sequence to sequence model, where as we've already seen earlier in this module, we have an encoder on N which will summarize the given sequence into context vector, which is then given to a decoder on N which will then produce an output sequence. So the input sequence needs to come in here. And in this concrete application, the input sequence are the words that we see in this annotation of the Java method. And the output is a sequence of API calls. And now during training, we basically just take the data that we get from the aesthetic analysis. So basically these call sequences go here and serve as the input sequence. And, oh, this was wrong actually. So the annotation goes here because the annotation contains the words that describe in natural language what the API usage is about. And the sequence of API calls is what goes here as the sequence that we expect the decoder to predict. And then once this model has been trained, we can use it to actually help a developer. And during this API prediction, we do not use the data extracted from the aesthetic analysis, but instead we have a developer here who's providing some query in natural language to this model. And then the model can predict some sequence of API calls that hopefully helps the developer to understand what API methods to call and in what order to call them. So to give you a feeling for how well this model works in practice, let's just look at a few examples that are mentioned in the paper. So one example uses a query called generate MD5 hash code, which basically says, hey, I wanna compute this MD5 hash, I don't know how to do it, please tell me. And then what you get as the output sequence are these three calls. So message digest, get instance, message digest update and message digest.digest, which happened to be three calls that you typically would invoke in order to generate an MD5 hash. A second example is here where we're just asking, hey, I wanna convert an into a string and what you get as the response is just a single API called namely integer.toString, which is string exactly that and which will convert an into a string. And then a third example is this one where the query is get files in folder. So apparently the developer wants to get all the files in a folder and one way of doing this would be that you first create a file object, then call list and then for every file in this listed name of files, this list of file names you would create a new file and then check if this file is a directory in order to only get the files and not the subdirectories in this folder. So what you can see from these examples is that this approach can work pretty well. Of course, there will be cases where it does not work but overall it's a pretty nice idea to use this natural language information and to combine it with API usage sequences. All right, so now you've seen one concrete application of using sequence-to-sequence models for analyzing software where given a natural language query, the task is to predict an API usage that does what the query is describing. Thank you very much for listening and see you in the third part where we look at another application of sequence-to-sequence models.