 Hi and welcome back to analyzing software using deep learning. We are in the module on token vocabulary and source code embeddings and this is the second part in which we will have a more detailed look at one kind of obtaining token embeddings namely by pre-training them. So the previous part we've seen what the token vocabulary problem is, which is this problem of having too many tokens to reason about. And now here we will see one very popular approach for addressing this problem. In the previous part we've seen how we can obtain a set of tokens that we consider to be our vocabulary and now here we want to look into the question of how to represent a given token as a vector. Why do we care about this question? Well, very simple because a neural model, so a deep learning model requires vectors as inputs. It cannot just reason about tokens given as say a string, but we somehow need to turn the strings into a vector. So essentially what we need is a mapping that looks like this. So it takes a token out of our vocabulary V and then maps this into a real-valued vector of length K. So just a vector of K real values. There's one very simple and maybe a bit naive way of getting such a representation and that is the so-called one-hot encoding. So what we are given is the set V of tokens and what we do here is to give each token T in the set V a unique index. So you can basically think of just looking at the set V as a list and then you take the index in this list for each token as the index of the token. And then we are creating a vector by taking a vector that has the length of our vocabulary size and all elements in this vector are zeros except for one element which is the index of T and this element is set to one. So essentially what you get is what is written down here. So you have a vector that maps every token T to a vector of length K where K is the size of the vocabulary and the element of this vector is either zero or one and it's only one if i happens to be the index that represents this token T. So let's have a look at a simple example to illustrate this idea of one-hot encoding and for this example let's assume that our set V so our vocabulary consists of the keyword if, open parenthesis, closed parenthesis and also id which is just an abstraction of all the identifiers that we may have in our code. So now what we do is we will create this mapping E that takes some token and maps it to a vector in this case to a vector of length four because that's the size of V and specifically we will look at the position or the index of each of these tokens in a list representation of the set V and for example for the token if we will then get a vector that consists of a one at the beginning because if happens to be our first token followed by all zeros until we have reached four elements and that's the length of the vector and then similarly for the open parenthesis token we would have all zeros except that the first element because so the element at index two is one and all the rest is zero for the closing parenthesis we would have the one here and then for the id everything would again be zero except in this case the last element which then is one. So this is what is called one hot encoding it's very naive and simple way of encoding a token vocabulary the big benefit is that it's simple but the big disadvantage is the size of the vectors that we'll get because if you have a large vocabulary then we will get very long vectors and most of them are actually just filled with zeros and there's just one somewhere so it's not a very space efficient way of encoding tokens a much better way of encoding tokens and this is what we'll look at next are so-called token embeddings so here the idea is again that we take the tokens and map them into a vector space but in this case the vector space does not have the same size as our vocabulary but it actually is much smaller than this size of v and the idea here is that we want to map every token to a vector such that tokens that are semantically similar will have a similar vector representation so for example two identifiers that have more or less the same meaning let's say index and i and i and d will have a similar vector representation or at least this is the goal of these token embeddings. Let's just have a look at a concrete embedding that was learned in some in some project to get an idea of how this could look like so in this case you see vectors that are projected into two dimensions so that's a very very short vector size typically the vector size is more something like 100 or 200 but you can project it into a lower dimensional space and this is exactly what you see here where we have a projection into 2d now each of these points here in the space corresponds to one token in our vocabulary and what you see is that tokens that are semantically similar to each other for example container and wrapper happen to be close to each other so this encodes some semantic similarity between these tokens in the vector space. Also here what you see is basically all identifiers or in this case error also a literal that are related to messages and alerts and errors so this is also closely related and these three here all happen to correspond to some kind of sequential data structure a list or a sequence. Having such an embedding is pretty nice because it basically enables you to bring some of the semantic information that you have about tokens for example which identifiers mean similar concepts into the vector space and this will enable the model to reason about these semantic relations between individual tokens instead of just seeing them as a sequence of some data that we don't know anything about. Now the big question is how can we get such a vector embedding for our token vocabulary? There are essentially two options which we'll briefly discuss now and then we look more into into one of them. Option one is to learn this embedding function e that maps each token into a vector space jointly with the rest of the model so there is some model that we care about right maybe it predicts bugs maybe predicts types maybe does something else with code but there's a model that is going to be trained and while training this model we are also learning this embedding function e jointly with the rest of the model. This basically works in in such a way that at first every token is encoded in some more naive way for example using the one-hot encoding and then the very first step of this model is to have a projection of this for example one-hot encoding into a smaller dimensional vector which will then serve as the embedding and as the rest of the model is trained this function is also blurred. What is nice about this option one is that the embeddings that you get will be a good fit for the downstream application that you care about so for example if your model is about bug detection then these embeddings are specifically learned to be well suited for bug detection. Option two is to not train this model jointly with the rest of the model but to pre-train a separate embedding model ahead of time which has the big advantage that we can specialize a neural architecture just for learning a good embedding and we'll actually look into some of these embedding models in the next few minutes. So the big advantage is that we are likely to get a better embedding simply because the model is really made for this and another advantage is that we can do this pre-training once on a huge corpus of code which may not even be the corpus that we are learning on when we train the rest of the model and then we can just reuse this pre-trained embedding over and over again as long as we stay in the same programming language. So for the rest of this part of the model we'll focus on this option two and look into some options for pre-training and embedding model for tokens. One very popular way of learning embeddings not only of tokens and code but also of words and natural language is the word-to-back model. So this model originally was proposed for natural language information for example for texts and the words that you have in text but nowadays has also been adopted very frequently for reasoning about source code because you can basically apply the same kind of idea to a sequence of tokens as to a sequence of words. So the basic idea of word-to-back is what is called the contextual hypothesis which basically is summarized in this one sentence you see here saying that you shall know a word by the company it keeps. So the context that is around a word tells you something about the word itself. So if you just look at the words among which a particular word or token is occurring very often then this will tell you something about the meaning of this token and if two tokens are occurring in similar contexts then these two tokens are likely to be similar. And this is the key idea that word-to-back is using and we'll see two variants of this idea that we can then use to actually learn a vector embedding of tokens based on the context in which these tokens occur. So let's have a look at the first of the two variants of the word-to-back model that we want to talk about here and this is the so-called continuous back-of-words model and continuous back-of-words is often abbreviated as CBO. So the basic idea of this model is that you can predict a token or a word from the surrounding tokens or words that occur around it. So basically if I give you a piece of code and I'll tell you a couple of tokens then a gap in the middle and then another couple of tokens then you should be able to guess what the token in the middle is and this guessing is exactly what the model is learning to do because not only a human can do this but in this case also a model so basically what it's trying to do is to predict a token from the given context so the surrounding tokens. So what we eventually want to get out of the model is basically a vector representation of the token let's say this is called token i and what is given as the input to this model is a sequence of tokens before this token ti so let's say ti-2 and ti-1. You could of course also have a larger window of context and also some tokens that follow afterwards so let's say ti-1 and ti-2 and now what we'll do is we are in the most simple version of this model at least just feeding all of this information that is given through a hidden layer and then from there we want to predict the token that is missing in the middle and as usual these connections here are determined by some matrices that are learned. So let me just note down a few more things here so this is the input layer then we have this hidden layer in the middle let's call it h and then we have this output layer y at the end. Now this context size can of course be different so what we have here is a context of size 4 so now let's have a look at how the hidden layer and then also the output layer y is computed from the input layer x so one common way of computing y is the following where we say that oh sorry computing h is the following where we are basically taking all the tokens that are given so all the tj's for j where j is actually going in this case from i-2 to i plus 2 and in general this will be i minus k divided by 2 to i plus k divided by 2 but without i itself because we do not know the ith token obviously otherwise the task would be very simple and what we do here is to just sum all these vectors then we multiply them with our weight matrix u and u is what is of course going to be learned while training and then we just normalize this by the number of input tokens k and then once we have y we can also compute sorry once we have h we can also compute y and here we will use a softmax function to basically get the probability of individual tokens to be the token that we want to predict so in this case y will be the result of softmax applied to this second weight matrix v times our hidden vector h now cbo is one variant however to back can be implemented there is another second variant that also turns out to be pretty effective and that is also used pretty often and this is called skip cram so in skip cram we are basically trying to solve the inverse problem of what we've just seen in cbo because now we are predicting the context of a given token given this token so you can think of me giving you specific token in a program and then i'm asking you hey what do you think are the tokens that come just before and just after this given token so what is given here as the input is the token ti and that would what we want to get as the output for example the two tokens before and the two tokens afterwards so this would be ti minus two ti minus one ti plus one and ti plus two and again the most simple way of doing this is to just have one hidden layer in the middle where we feed in the given token ti which is represented as a one hot vector and then from there we want to predict these surrounding tokens and as usually each of these steps is controlled by some weight matrix called u and v so again this will be the input layer x this in the middle is our hidden layer and then what we want to predict here is the output layer now let me also tell you how this these layers are computed so given x what h will be is simply the multiplication of this weight matrix that is going to be learned with the input layer and then once we have h we can predict all the output tokens again using the softmax function which for each of these tokens gives us a probability distribution over the possible set of tokens and this softmax is applied to v times h and at the end what we'll get is for each of these four tokens a prediction what is the most likely token to occur here in the context of the given token ti all right so now you've seen these two variants siebo and skipgram of the word to back model but what they basically do is predict either the token or the context from the other piece of information but what you actually care about at the end is how we can get an embedding so now let's have a look at how to get an embedding once we have these siebo or skipgram models so in both cases the idea is the same and the ideas that we first train a model for this pseudo task that we've just seen so either we have this siebo like model that takes the context and predicts the missing token or we have the have it the other way around the skipgram model that takes the token and then tries to predict some context so what we'll do in both cases is to first train the model to be good at this and then once the model has become good at actually predicting the token itself or the context of the token we then look at this hidden layer here in the middle and what this hidden layer essentially is is a summary of the given token it either is a summary based on the context or a summary based on the token itself but in both cases we will be able to use this hidden layer as a vector representation of the token in the middle of this token ti so essentially what we do here is we wait until the network has become good at this pseudo task and waiting here doesn't really mean just sitting around and wait but we're actually training the model in order to have a good model for this pseudo task and once the model is good at this task we are using this hidden layer which essentially is a vector and it's this vector that I've already marked with blue up there we're using this layer as the vector representation or SD embedding for our token ti so now this virtual back model works pretty well and it has been used widely in a couple of analyzing software with deep learning applications but it has one big problem and this is the out of vocabulary problem essentially what this means is the following we will have some set of tokens that we consider to be our vocabulary which we will use while training this embedding model and then once we have trained it at some point we want to predict things on a different set of code and what may happen is that this set of code that we use during prediction contains some tokens that we have never seen during training now because the word to back embedding doesn't really know anything about these tokens all we can do is to represent it as a special unknown token which basically means we are throwing away all the information that we have in these tokens and this may actually be very valuable information so for example this may happen if we want to do the prediction on a program that comes from a different domain than the code that we have trained on and then in this new domain there will be some identifier names that never appeared in our training data but what we will do is we basically throw away all this information because we do not really know anything about those tokens one nice idea for addressing this out of vocabulary problem is to not only learn embeddings on entire tokens but to also look at so-called subtoken so basically substrings of the given token so the way this addresses the problem is that we now learn an embedding for each subtoken and once we see another token that was not in our training data then maybe we can compose this new token from some subtokens that we have already seen during training so as a concrete example consider this name set height that might be just some method name in a piece of java code for example we could for example decompose this into two subtokens set and height and once we see another identifier for example modify height during prediction then maybe we already have information about this subtoken height and maybe from some other token we've seen during training we also know something about modify and from these two embeddings of modify and height we can put them together and get some representation of modify height now one question related to this subtoken idea is how to actually construct these subtokens in the previous example i just used the typical conventions for splitting identifier names and that's a very good first approximation but there are also more general approaches for addressing this problem of yeah basically decomposing a token into subtoken so that we can then learn embeddings for each of these subtokens one of them is the fast text approach which is yeah a follow-up work on word to back that significantly improves the word to back by decomposing tokens into the so-called character n-grams so a character n-gram is basically just a sequence of n consecutive characters that appear consecutively in a given token and now what fast text is doing is to learn and embedding for each such n-gram in a given token using a word to back like skip gram model and then once these embeddings for the individual n-grams have been learned it's using all the n-grams that appear in a given token to compute the embedding of this token so given some token t it's basically looking at all the subtokens or n-grams in t and for each of those subtokens s it's computing the embedding and then it sums up all these embeddings to get one embedding for the entire token t and the nice thing about this is that even if our training data did not include exactly the token that we need during prediction probably we have seen some of the n-grams in this token and then we can reuse this information to get an embedding for this new token so let's have a look at a concrete example to illustrate this idea of fast text a little more so let's say our given token is the one that we've already seen earlier namely get height and let's assume that the n for the n-grams here is three so we're basically um caring here about three grams so three consecutive characters at a time so let's have a look at all the three grams that we have in this given token so the first three characters would be one of them then we would have the next three characters so et and h as another one th and i as another one these three would be yet another one and those and finally also g h and t as another one so as you can see some of them make sense so get is also a word on his own some of the other ones do not really make sense but what fast text is doing is just blindly extract all the n-grams or three grams in this case from the given token and then for each of them we get some embedding so we get an embedding of this one and we get an embedding of this one and also an embedding of this one and the same here right so we at the end will have six different embeddings and then all of these embeddings are summed up and this at the end gives us the embedding of the token t where t is this token here the get height token that that we are given so fast text is one way how an embedding can be learned on particular subtokens of a given token and as you've seen these subtokens are extracted in a pretty generic way by just looking at these n-grams another alternative approach that actually uses the given data the given the given corpus of of code or in this case the given vocabulary to compute these subtokens is an algorithm called byte pair encoding so you can think of this as a compression algorithm but it also as a side effect gives us a list of subtokens to consider for a given token so let me try to explain how this byte pair and coding works what we are doing at the very beginning is to have just one subtoken for each character in our vocabulary so if our vocabulary just consists of tokens formed from yeah less than characters then basically a to z would be the subtokens that we start with and then there's this big loop that basically tries to create new subtokens by always finding two existing subtokens that we already have and looking for the pair that is most frequent across our entire vocabulary so maybe if you have all the characters at the beginning then we maybe see that let's say g and e together happens very often because there are many tokens that contain get and other things that contain a g followed by an e and this is so let's say g e is the most common pair of subtokens that appear consecutively then what the algorithm will do is to join g and e into a new subtoken by just merging these two together and this is done repeatedly until we have some given number of subtokens in our set of subtokens and this number is configured so you can basically configure how long this whole byte pair encoding algorithm will run and what it gives you as a side effect is an ordered list of merge operations so each of these merge operations that we've found in this in this loop is added to this list l of merge operations and now once we have this and we are now given a token that we would like to represent then we will split this token t into characters and then merge these characters and also later on larger subtokens using all the operations that we have in l by using the same order that we've used to put these operations into l so what we'll do first is to basically merge individual characters and then merge larger subtokens into each other always giving a new subtoken that is already in our list of subtokens and and this way we basically find a decomposition of the given token t into subtokens such that the most common subtokens that have appeared or do appear in our data are used instead of just using all n-grams as what we've seen in fast text all right so now you've seen also this third way of handling the vocabulary problem for tokens where we compute an embedding that maps every given token into a vector and all of these approaches have the common feature that the vector that we'll get is constant even when the code code corpus grows and the reason is that we can basically specify this this vector size for example if you use word to back we can specify the size of the hidden layer and this will then be the size of the embedding that is learned yeah and this is already it and this second of three parts on code vocabulary and token embeddings i hope you have learned something so thank you very very much for listening and see you next time