 Hi and welcome to this talk about typewriter, which is an approach for neural type prediction with search-based validation. My name is Michael and this is joint work with Jojo's, Jason and Satish. And while doing this work, we've all been at Facebook, in my case, for a sabbatical. The motivation for this work is that dynamically typed languages have become very, very popular, but they are lacking type annotations. So in code written, for example, in Python or JavaScript, developers by default do not add type annotations, which leads to a lot of problems. One of them, the most obvious one is that you may have type errors that are not statically detected, but maybe only at some point once your program is running. Of course, type annotations are also useful to understand API. So if you don't have type annotations, the APIs will be harder to understand. And finally, the lack of type annotations also leads to pretty poor IDE support, because IDEs can make better suggestions if they know about the types of functions and variables and so on. Fortunately, there's this idea called Gradual Types, which means that you can add some type annotations to your program, and then type checking will be done only on these added type annotations. But of course, someone still needs to add these type annotations to an existing code base, and it turns out that programmers do not really like to spend a lot of time on annotating types. So how can we automatically add types to an existing code base? There are essentially three options. Option number one is static type inference. So it's a static analysis that typically works conservatively. So whenever it suggests a type to you, then this type is guaranteed to be correct. But in practice, these tools turn out to be pretty limited because dynamic languages are inherently hard to analyze. Option number two is dynamic type inference. So here you're running your program, and then at runtime, you're observing some types, and then you add these types to the source code. The problem is that depending on the inputs, you may see some types, but not all possible types. And of course, you may also miss a lot of types if not all code is covered. Option number three, and this is what we'll use here in this talk, is probabilistic type prediction, where you have some kind of model that learns from, for example, existing type annotations that you have in some part of the code base, and then learns how to predict more types that you can then add to your program. One popular way of doing probabilistic type prediction are neural models to predict types. So these are deep learning based models that look at the source code and some pieces of information in the source code in order to suggest type annotations. For example, these models look at identifiers because identifiers often give great hints about the types of functions and variables. They may also look at comments because this natural language information is also pretty helpful to guess what type you should annotate. And of course, they also look at the code, for example, in the form of code tokens. There have been a couple of popular models proposed over the past few years, and we will actually compare to those in this work here. Now these probabilistic type prediction approaches have a lot of advantages, but they also bring a couple of challenges. Two important challenges are listed here. So the first one is imprecision. Because these models are probabilistic, they may make wrong predictions, and in practice they actually do make wrong predictions. So someone, for example, a developer must decide which of the suggestions that these models make actually to follow. The second problem is that if you think about these different predictions that these models are making, then you have a combinatorial explosion of options for which types to actually annotate. So essentially for each missing type in your code, there will be one or more suggestions, and if you now want to explore all combinations of these suggestions to decide which of them to actually use, then you have a combinatorial explosion, and in practice, at least for large programs, exploring all these suggestions is not really possible. So let's illustrate this problem with an example, which is a piece of Python code. So here we have two functions. One is called findMatch, and it takes an argument called color that is compared against a list of colors returned by this helper function, getColors. And then if a color was found, that color is returned, and otherwise the function returns none. Now, if you want to add type annotations to this code, a neural type prediction model may, for example, make the following predictions. It may suggest that this argument called color is an int, and if this is not correct, then maybe it's a string, or maybe it's a boolean. And likewise, the model will make predictions for the return type of this findMatch function and also for the return type of the getColors function. Now the naive approach would be to just take the topmost predictions, and if you add all these three type annotations to the program, you will actually get some type errors, for example, because this function argument color is annotated to be an int, but at the same time, returned here, where the return type is supposed to be a string, which simply doesn't go well together. Now it turns out the correct type annotations are also in this list, in these lists of type predictions. So the challenge is to actually find them among the suggestions made by the model. Now this is exactly what our approach does, and this approach is called typewriter. Typewriter takes as an input a program and wants to return to you a program with type annotations. So it's essentially taking your code and writing types into it, just like a typewriter. It does so in two steps. The first step is to use a probabilistic type prediction model, which in our case is a neural network that combines information extracted through a lightweight static analysis and feeds this information into a prediction model, which then returns a list of type predictions for every code location where a type annotation is still missing. Now given these lists of type predictions, we then use a static type checker and a feedback directed search to search for consistent types among the predictions. As you've seen, not all the predictions will lead to a type correct program. So the challenge is to actually find those that do and to only annotate the program with those type annotations that give you an overall type correct program at the end. Now let me go through these different components of typewriter and let's start with the first one which extracts information about the source code and in our case it does extract both natural language information and programming language information. On the natural language side what it does is it will extract the names of functions, the names of arguments and also function level comments because all these pieces of information are useful for predicting the types of functions. On the programming language side it's going to extract occurrences of the to be typed code element in the code and code that is surrounding these occurrences of this program element and it's also going to extract the types that are made available through imports because those give you a hint about the types that the program might actually want to use. Let's illustrate this with our example again. So here's again the same Python code as before. One piece of natural language information that typewriter is going to extract are the identifiers that are associated with the to be typed program element. For example with this function where we extract the name of the function and the name of the parameters. Another piece of natural language information are function level comments like this one which in this case give a great hint about at least one of the types because it tells us explicitly that the color argument should be of type string. In addition to this very useful natural language information typewriter is also extracting programming language information. In particular it's looking at every occurrence of the code element that we want to type. For example it's looking at all occurrences of this parameter called color and then looks at the tokens around this color parameter and extracts this sequence of tokens. So there's one occurrence here and another one down here and for each of those typewriters extracting the surrounding tokens. In addition to this it also looks at imports. So let's say in on top of our piece of code we have some imports then it will also extract what types are imported here because this often indicates the types that a developer actually would like to add in an annotation. Now after extracting all this information it is given to our neural type prediction model. So the input to this model are the code tokens, the identifiers, the comments and the available types that are associated with a particular code location and at the end we want to get a type vector which can be thought of as a probability distribution over the large set of available types that tells us how likely the different types are for this specific code location. For the code tokens we use a pre-strand token embedding to get a vector representation of every code token and then feed the sequence of vectors that we get from the embedding into a recurrent neural network and RNN which summarizes everything into one vector and then this is used for the final part of the model. We take a similar approach for the identifier and the comments just that we now use a word embedding so that's a pre-strand embedding for natural language words. Again the identifiers and also the comments are summarized into a single vector using an RNN and then we also have this vector of available types and all of this is given to a hidden layer which at the end feeds everything through the softmax function so that we get a probability distribution over the set of available types. Now what this model gives us is a list of predictions for every code location where a type is missing and now the second part of the approach is going to search through these predictions for a set of consistent type annotations that we can add to the program while still having a type correct program. So what we get here is the list of top k predictions for each of the missing types and then we filter these predictions using a radial type checker so we use one of these many existing radial type checkers for example you can use pyre or mypy for python or flow for javascript. Now the challenge here is that this turns out to be a combinatorial search problem because if we have a set s of type slots and we have k predictions for each of these slots then the number of possible type assignments so ways to assign types to these slots is exponential in the number of type slots and in practice because programs are large and because they're many missing types this space is too large to really explore exhaustively. Instead of exploring this space exhaustively we're looking at different variants of the program p that each use a different type assignment so each add a different set of types. So let's say we're starting with the program as it is given to us so it may already have some type annotations or maybe doesn't have any then there are different ways of adding types and you can think of this as a tree that we are exploring where we whenever we take an edge either add, remove or replace some types and then do so in order to explore the space while not trying all possible annotations. Now the question is which of these variants do we actually want to explore first and how do we navigate this search space in in an efficient way? The answer is that we use a feedback function that guides this exploration of the search space. What this feedback function does is to look at two things at on the one hand we want to minimize the missing types in the program so we want to add as many types as possible while also minimizing the number of type errors that we are introducing at the end we actually do not want to have any type errors so that we are guaranteed to get a type correct program. To do this the feedback function combines both the number of still missing types and the number of type errors into a feedback score which is basically just a weighted sum by default we give type errors a higher weight so that we do not really favor adding incorrect type annotations but in case of doubt do not add a type annotation instead of introducing a type error. Given this feedback function typewriter is now exploring the search space trying to minimize the feedback score so trying to get as many type annotations as possible without introducing type errors. Our search space exploration works in an optimistic way which essentially means that typewriter is adding the top most predicted types everywhere and then removes or maybe refines some of these types in order to reduce type errors. Typewriter can be configured to use a greedy or a non-greedy exploration strategy where greedy basically means that whenever the score tells us that things get better we just go down in this tree of explored variants of the program whereas in the non-greedy case the search sometimes backtracks to get stuck in local minima. Let's look back at the example that we've seen before so just to remind you the probabilistic type prediction model will give us some predictions for each missing type annotation for example these predictions here and again to remind you if you just take the top most prediction we will actually see that there are two type errors so the score that we get from the feedback function tells us that this is not good yet because we are getting a couple of errors here. Now what the search will do then is to either remove one of these type annotations or go down the list of predictions for example let's say that it decides to not use the first prediction for the color argument but the second prediction so we'll use string instead of int and voila this actually reduces the type errors by one so we know that this is better than the variant of the program that we have seen before and then it's going to still explore more options because we still have some type errors here for example it might use the second instead of the first prediction for the return type of findMatch which will be this type yeah this set of type annotations here which turns out to be type correct and in this case the search will actually stop because we have added types to all missing type annotations and at the same time have obtained a program that is type correct. Let's now have a look at how we evaluate typewriter. We apply the approach to two code corpora one is all the python code at facebook which consists of many millions of lines of code and another one is an open source code corpus consisting of almost six millions lines of code. In this code there are millions of places where you would like to add type annotations and we here focus on argument types and return types of functions and out of those possible type annotations between 6 and 12 percent are already annotated which we use to train the neural model that typewriter is based on. Let's start by having a look at the effectiveness of the neural model alone so just the first part of typewriter. What you can see here in this table is the precision the recall and the f1 score of the predictions that we get from the neural model and this is in the top one prediction so just looking at the top most prediction that the model is making. We can of course also look at more predictions for example the top three or the top five and obviously the numbers are getting better because in say the top five predictions you have a higher chance to actually find the correct type. If you look at the top five predictions then typewriter is able to predict 75 percent of all missing types with a precision of 85 percent so overall it's pretty effective. We also compare the typewriter neural model to some baselines. Here we show the results for one of the existing neural models namely NL2 type because this turned out to be the best competitor. What we find is that typewriter is actually a bit better so looking at the f1 score it's five percent higher. Another baseline we considered is a purely frequency based approach that simply predicts the most frequent type all the time and then the second most frequent type as the second most suggestion and so on which is admittedly a pretty naive approach but an interesting baseline nevertheless and what we find here is that this is significantly less effective than the neural model in typewriter. Let's now have a look at the effectiveness of the overall approach so including the neural model and the search. We do this by looking at different variants of the approach. We in particular look into these two strategies the greedy search and the non-greedy search and then for each of those look into different configurations where you look at the top one the top three or the top five predictions made by the model and then what we show here are the number of correct type annotations where you can consider two definitions of correct. One is the number of type correct annotations so all the annotations that do not lead to a type error and then we can also look into the number that match a known ground truth. The two are not always the same simply because there may be some other type correct annotations that are different from the known ground truth but nevertheless type correct. Now let me show you the results. What you see here is that depending on the strategy and the configuration typewriter finds up to 75% of all the annotations that are missing and in a way where the code is type correct and 65% of them actually also match what the developers have annotated at some point so many of the missing types can actually be successfully predicted. We also compare typewriter to an existing static type reference approach namely PyreInfer and what we find here is that it also can predict some of the types but the percentage is much much lower and actually almost all these types are subsumed by the types predicted by typewriter so typewriter is strictly better than the static type inference. While all they're pretty happy with the results there are of course some limitations that future work could address. One of them is that type correctness does not equal soundness so if a type is type correct according to the type system that does not imply that the developer actually wants to use that type because sometimes there are multiple type correct types and a developer may only like one of them. Another limitation is that typewriter is based on a fixed type vocabulary so that means there's some rare types that it just cannot predict and it would be interesting to address that limitation. Finally typewriter relies on an existing radial type checker which turns out to be not particularly fast so having a more tight integration with type checking could actually lead to an overall faster approach. So in conclusion I've presented typewriter the first neural type prediction approach that uses search-based validation so it's combining a neural type prediction model with a radial type checker to make sure that every annotated type that is added to the code is actually giving you a type correct program. Typewriter is being used at Facebook right now and has already annotated thousands of otherwise missing types. Thank you very much for listening. If you have any questions feel free to ask them in the Q&A session at FSE or if you listen to this later feel free to email me or one of the other authors and we'll be happy to answer any questions. Thank you very much.