 Hi, welcome everybody to Analyzing Software Using Deep Learning. What we'll do in this module of the course is to look at one of the cross-cutting or underlying problems of this whole field of Analyzing Software Using Deep Learning. And that's the problem of how to actually represent the source code of programs. In particular in this module we will look into how to represent tokens, so the basic building blocks of source code. And then in a later module we'll see how to use these representations to find representations for larger snippets of code. So here's an overview of what we'll cover in this module of the course. So at first I will explain the token vocabulary problem. So the problem that we're trying to address here, which is essentially about the many tokens that exist in source code. Then we look into a couple of approaches for addressing this problem. And one of them, which we'll spend some more time on, is to use pre-trend token embeddings. That's an approach that has been very popular in natural language processing, but also shows to be pretty effective in modeling source code. And then in the third part of this module we look into learning embeddings that cover both natural language information and programming language information, which is pretty useful to not just reason about source code, but also natural language artifacts associated with source code, such as comments or queries that someone might type into a search engine in order to find some source code snippets. As usual, this work that we're covering here is based on a couple of recent papers. Today there are more than three papers, so I'm just highlighting three here on the slides that are recommended reading for this module, because a lot of the content that you see here is covered in these papers. And as usual, of course, these papers provide many more details. All right, so let's start by looking at what we're actually trying to model here. So one way of looking at source code, and that's the way we will mostly focus on here in this module of the course, is to consider source code as a sequence of tokens. So we can think of tokens as the basic building blocks of code, because basically every piece of code, no matter in what programming language is composed of tokens. So in order to reason about source code and larger code snippets and maybe entire programs, the first thing you always need to do is to reason about individual tokens and have some way to represent these individual tokens in a way that is suitable for neural networks or for deep learning. To illustrate this problem a little bit, let's look at this piece of JavaScript code that we are seeing here. So this basically consists of a sequence of tokens. So there, first of all, is this comment, which you can think of as just one common token, or which you could actually split into multiple tokens if you actually look into the words in this comments. Either way, it's fine, but no matter which way you take, you'll get some tokens out of this. And then we have some more tokens here. For example, this identifier followed by the dot, which is also a token. And then yet another identifier followed by the open parenthesis, which is yet another token. Then we have this literal value. So this constant 100 is called the literal and then a comma and many, many more tokens. One more kind of token that I'd like to highlight here is this function keyword because that's another kind of token that frequently appears in source code. So talking about the kinds of tokens that you can find in source code, there are essentially two categories of tokens. One is all the tokens that are fixed by the programming language. So there are some tokens that just inherently belong to the language in which the code is written. And these are the operators, things like parenthesis and curly braces and so on, and of course, the keywords of the language. Typically, this is a relatively small set, depending a bit on the language, but usually it's a couple of dozens, maybe 100 or 200 tokens, but usually not more than that. In addition, you have a second category, which typically provides many, many more tokens. And those are the tokens that are chosen by the software developer. In particular, these are two kinds, namely identifiers, so variable names, function names, names of properties and so on, basically everything you can name in the program and literal. So all these constants like string constants or numbers that also appear in the source code. And in the second category, you usually have a lot of different tokens simply because developers are pretty creative in creating new variable names or coming up with specific constants that matter for their particular program. Now, this huge size in particular of the second category of tokens that we find in source code brings us to the problem that we're actually talking about here, and this is the vocabulary problem. So in essence, this problem is about this huge number of tokens that you'll get as soon as you look at not just a tiny program, but at a larger program or what you typically do in the context of analyzing software using deep learning when you look at a large code corpus. So the problem here is that if you have a large, very, very large vocabulary, it is usually difficult to represent and also difficult to reason about. It's difficult to represent because there are just many things you need to represent, so any trivial approach for doing this will end up with very, very large vectors and those are typically not very efficient if you want to train a model. And the huge vocabulary is also difficult to reason about simply because there are so many different kinds of tokens that it may be difficult for the model to actually generalize and understand the more general patterns that you can see in source code, but that may not be apparent if you see so many different tokens. This vocabulary problem is relevant for two kinds of models, and these are both models that take code as an input and models that produce code as an output. So if you think of a model that takes code as an input, let's say a model that reasons about a piece of code and tries to determine whether the code is buggy or not, then somehow this model needs to take all this code in. So if you have a lot of different tokens, then the problems that we just talked about, representation and reasoning are a problem on the input side. On the output side, we also have a problem because if the model tries to produce code, for example, by providing code completion, then it must predict one out of many, many, many different tokens each time, and that's an inherently challenging problem because if you have many options to choose from, it's much more difficult to choose than if you had just a few options. So let's look at some concrete data to illustrate this vocabulary problem a little more. So what you see here is a plot taken from this paper here where the authors of the paper have looked at 14,000 projects and basically plotted how many different tokens you find if you take either all of these 14,000 projects, so this is the 100% here, or some subset of them, for example, just half of them. And what you see on the vertical axis is the number of different tokens that you find when you consider all these projects. And what you can see here is pretty surprising. So one thing you can see is that if you consider all these 14,000 projects, then there are almost 12 million different tokens. So this is not counting the number of tokens you have in total in this code, but it's the number of unique tokens. So there are 12 million unique tokens in these 14,000 projects. Now, 14,000 projects may sound like a lot to you, but actually this is a reasonable corpus size for realistic learning on software. Now, one thing you could do, of course, is to ignore some of these tokens. So for example, you could ignore all comments because maybe in comments you have a lot of diversity and maybe let's ignore also all strengths that occur in the code because maybe there are a lot of strength constants and they are very specific to the program. But even if you ignore all of those, so this is the gray line that you see here, then the number of tokens is still very large. So it's still above 9 million, so it's still a lot and more than you can reasonably model. Another possible approach is to actually split tokens. So instead of considering all these identifier names that developers come up with, let's just split them based on the usual conventions that are used to compose identifier names which are camel case and snake case. If you do this, as you can see, the total size of the vocabulary reduces quite a bit. So this is something that definitely does help, but what you should also see in this plot is no matter how you model the vocabulary, whether you omit some of these tokens or split them into sub tokens, there seems to be a linear growth in the number of tokens in the vocabulary when you add more projects and this linear growth doesn't really seem to saturate anywhere, even if you consider up to 14,000 different projects and this is the essence of the vocabulary problem. Now there are different ways how you can handle this vocabulary problem and we will talk about three of them which happen to be the most popular ones in this course. So one way which we'll cover in some more detail in just a minute is to abstract the tokens. So the idea is to not consider the tokens as they are but to abstract them in some way so that you get a much smaller vocabulary, but the disadvantage is that you're also losing a lot of valuable information because you're essentially abstracting away the details. For example, if you abstract away the details of identifier names, then you're losing all this implicit knowledge that developers put into meaningful names and the model just cannot learn from that anymore. Another possible approach is to consider just the end most frequent tokens that occur across your code corpus which turns out to cover a pretty large fraction of all tokens. So this is great by just looking at some number N of most frequent tokens. You can cover quite a lot of all the occurrences of tokens in normal source code, but there's a disadvantage still and that is the auto vocabulary problem which basically means that if you focus on the top N tokens then there will be some tokens that you do not cover and if you add more code, there will be more and more of those tokens that you do not cover and that is not really good because then you're missing parts of your input or cannot predict some of the outputs that your model should predict. And then there's a third option which we'll also look at and this is to embed the tokens into a vector space such that you get a vector embedding or a vector representation of every token that has constant size even if the code corpus is growing. The main challenge here is that it's non-trivial to obtain an effective embedding and we will look into some approaches for doing this in a few minutes. So let's have a more detailed look at the first of these three approaches and this was to abstract tokens so that we have a smaller number of tokens all. So let me start by abstractly showing what is actually meant by this idea. So let's say this is your source code and each of these little boxes is one of the tokens that you have in the code. So for now it doesn't really matter what these tokens are but let's just call them token one, token two, token three and so on. So if you have this code there will be many of these tokens and some of them may actually be the same. So maybe this is actually not token five but maybe this is token two again because some of the tokens are of course occurring again but even if they occur again there will still be a very large set of unique tokens. So now what abstraction will do is to abstract these tokens into a smaller set of abstract tokens. So let's say that maybe T one and T two are abstracted into the same kind of abstract token and then this instance of T two will of course also be abstracted into this abstract token A one and let's say that A two is an abstraction that will cover both T three and T four. And now what you, so this is basically the abstraction and then the result of this will be that our token sequence now looks differently because now we can represent the sequence from above as A one, another occurrence of, oops, another occurrence of A one and then an occurrence of A two followed by another occurrence of A two followed by yet another occurrence of A one and so on. So after this abstraction, the code will look differently and we are as with any abstraction we will lose information but the benefit is that we have a much smaller set of unique tokens in this example only two instead of four. Now this general idea of abstracting the tokens into some abstract classes of tokens can be instantiated in many ways and one of them is to do this abstraction based on the kind of token. So let me illustrate what this means through an example. So let's say we have some code that maybe checks if some variable file is unequal to null and if this is the case then we are writing into a variable called line what we get if we call file dot read. Now if you look at these tokens then you can basically assign a kind to each of these tokens. So this could be a keyword. This could be an operator or a parenthesis. So depending on how exactly you group these kinds this is different. This is an identifier and here we have another operator and here we have for example, a literal. And now what we could do is to basically rewrite the sequence of concrete tokens into a sequence of abstract tokens by just saying there's a keyword followed by an operator followed by an identifier followed by another operator followed by a literal and so on. Okay, so basically every token would just be represented by its kind or what you could also do you could keep the programming language specific tokens as they are. So instead of doing what I just did you could also keep for example the keyword if and keep this open parenthesis but then abstract away the identifier by just saying there's some identifier then again we keep this not equals operator because this is part of the language and we would also keep null because that's also part of the language same for the closing parenthesis but down here we would again abstract things into identifier and another identifier dot yet another identifier and then keep the parenthesis and also the closing curly price. Now one big disadvantage of this abstraction that I've just shown is that we're losing a lot of information and in particular where we are losing the information that some of these identifiers actually refer to the same variable. For example the fact that this occurrence here of an identifier and that occurrence of an identifier actually both refer to the file is probably interesting because you typically check if something is not null before you do reference it and if you just abstract everything into identifier then we're losing this bit of information. So a different approach that does not have this disadvantage is to do this abstraction by consistently renaming identifiers. So let's look at the same example that we've just seen again. So we would again have this code that checks if file not equal to null and then reads a line of this file using this read method. And now if you would want to consistently rename this file or this code so that every identifier is replaced by some abstraction of it but in a way where we can still see that two occurrences of an identifier refer to the same identifier then what we could do here is to do the following. So we would keep all the programming language specific tokens but instead of having file which is a programmer chosen token we will now say this is ID one or some other abstract name of an identifier and now for line we would pick a different name ID two but now here file occurs again so we again say hey this is ID one so we see that this is the same name and then read is yet another name so this would for example become ID three. So essentially we would build a map from the original names that we see in the code to some abstraction of names but in a way where we consistently rename occurrences of the same identifier everywhere so that we do not lose that much information. Now of course if you would do this across the entire code corpus you wouldn't really reduce the size of your vocabulary because you would basically just have another ID one, two, three and so on for every identifier name but you can only do this on a much smaller scope. For example you could do this for every file or maybe for every method or function that is analyzed. All right so now abstraction is one way to deal with the token vocabulary problem. The second approach that I'll briefly talk about here is to just keep the top end tokens of the code vocabulary. This idea is based on the observation that this vocabulary typically follows a so-called long tail distribution. So essentially what long tail distribution means is that there are some tokens that occur pretty frequently but it's only a few tokens actually. And then there are many, many other tokens that occur infrequently and that give you a long tail of infrequently occurring tokens. Now based on this observation if you just keep the end most frequent tokens and abstract or maybe throw away all the others then we still will cover a large percentage of all occurrences of tokens without having a huge vocabulary. For those tokens that are not among these top and most frequent tokens, what we'll do here is to represent them as some special unknown token which basically says that we don't know what it is but there is another token which is not part of our top end. Let's have a look at some concrete data to see how well this top end idea could work. So what you see here is some data taken from a paper that has looked at roughly 100,000 JavaScript files so a pretty large corpus of source code. What you see in this table, so in the first column what you see is the size of the vocabulary that one could set. So for example, in the first row you see a scenario where we would say, okay, we only focus on the top 1000 tokens in this whole corpus. Then here you see what percentage of all the unique names this set of top end is covering. So for example, by just looking at the top 1000 tokens we would only cover 0.4% of all the unique occurrences or all the unique names in this corpus. But then the nice thing is, and that's what's shown in the third column is that this will cover a relatively large percentage of all the names that are occurring in the code. So with just 1000 unique tokens we will for example cover 63% of all occurrences of tokens. And the reason is that there's some tokens that occur pretty often. So there's some names like I and J and maybe X and Y and also some longer names like list or set that occur pretty often. And this is why a small percentage of all unique names will cover a relatively large percentage of all the names that you find in the code. So another line that is highlighted here is to pick a vocabulary size of 60,000 which only covers 24% of all the unique names but happens to cover 90% or 91% of all the occurrences of names. So this covers almost all the tokens that you have in the code while having a much smaller vocabulary size than you would have if you just naively consider all tokens that ever occur somewhere in this code base. Good, so now we've seen the first two of these approaches that you see here. We've seen how you can abstract tokens and reduce the vocabulary size that way. We have seen what happens if you just consider the top and most frequent tokens. And then in the next part of this module we'll actually look into the third approach which is to embed tokens into a vector space. So this is all for this first of three parts in this module. Thank you very much for listening and see you next time.