 Hi, welcome to this presentation. I'm going to present Scaffold, which helps you localize where to fix a bug that crashes your software. And we do this in a code base that contains multiple millions of files written in multiple languages. My name is Michael, and this is joint work with Vijay, Rebecca, Mateo, Eric and Satish, who all work at Facebook. I'm at the University of Stuttgart, but I've done most of this work during a sabbatical at Facebook. The motivation for this work is that deployed software sometimes crashes. And if you have software that is used by many people, then you may experience thousands of field crashes every day. So the question is how to find out where to fix the bugs that cause these crashes given a large code base. This question is relevant for a couple of reasons. One is that it will help you to find the right team or maybe even developer responsible for fixing this kind of bug. Moreover, it will also help you to give this developer a starting point in fixing this bug. And finally, localizing the bug also may help in enabling automated program repair techniques, many of which assume that you know where the bug is. Our goal in this work is crash-based file level bug localization. Crash-based here means that the only input that we are considering is a so-called raw crash trace. So that's a file that basically contains evidence about the crash, such as stack traces and other kinds of information. File level bug localization means that we want to predict the file or maybe the set of files in which to fix these bugs. Of course, ideally we would like to get even more fine-grained information, for example pinpointing the specific statement that you need to change to fix the bug. But given the kind of code base we are interested in here, this is very challenging. So we are focusing on file level bug localization. To make things more concrete, let's have a look at two examples of these raw crash traces. So what you see here is a crash trace produced by an Android application. And what you can see is that it's pretty long, so it contains multiple dozen, sometimes hundreds of lines of text. And what you can see is that it contains different kinds of information. So for example, somewhere in this crash trace, we will have a stack trace or maybe even multiple stack traces because one exception may trigger another. But we also have a lot of other information. For example, information about the application that has crashed or maybe the system on which this crash has happened. Here's another example of a crash trace, this time from a server-side piece of code that is written in PHP. And what you can see again is that there are different kinds of information. So again, there's a stack trace and again, there's some information about the application that has crashed. But what you can also see is that the format differs a lot from the previous crash trace. So all of these crash traces come in various different formats, depending on the language, the platform and many other factors. And these formats are also evolving over time. To localize a bug given such a crash trace, we need to address a couple of interesting challenges. One of them is about scalability, which here means that we are dealing with a code base that may contain multiple millions of files. So it's too large to really analyze all of these files using a static analysis, but to maybe do a pairwise comparison between a crash trace and each of these files. The code base is not only large, but it's also very heterogeneous because it's written in different languages, it covers code running on different platforms and because it covers various different application domains. Finally we need to deal with the challenge that the information we are given, these raw crash traces, are a little bit fuzzy because the information that is in a crash trace may not exactly match the code base. For example, if a file path is mentioned in a crash trace, it may not match exactly what you see in the code base, for example because the device paths differ from the paths in the code base. Of course we are not the first to look into the problem of bug localization and there are two interesting streams of work that have inspired our work. One of them is based on traces of correct and buggy executions, which is great if you have these traces, but it does not really work in our case because we do not have tests that reproduce the field crashes. On the other hand, there's work that also considers some evidence of a bug, for example a stack trace or a bug report as an input similar to our crash traces, but we found that all of these existing approaches have some kind of scalability problem, for example because they assume that you can statically analyze all the code in your code base. Finally, practically all existing work focuses on a single programming language, but we here would like to have an approach that works across different languages. Let me now introduce Scaffle, which is our approach for localizing bugs. Given a crash trace, the goal of Scaffle is to predict which file out of the many files in the code base to change in order to prevent this crash in the future. The key insight of Scaffle is to decompose this problem into two easier sub-problems. One of them takes the crash trace and then tries to identify the relevant lines in this crash trace, and then the second sub-problem is to take these most relevant lines and to match them against the files in the code base. We address the first sub-problem using our so-called trace line model, which is a machine learning model that predicts which lines are most relevant, and we address the second problem using an information retrieval-based search. Let's start by having a more detailed look at the first of these two components, which is our trace line model. Given a crash trace, which essentially is a sequence of lines, we would like to predict what we call the relevance vector, which is essentially a vector of numbers indicating how relevant each of the lines in the crash trace are. We implement this trace line model as a machine learning model that learns from data how to predict the relevance of each line. The reason why we use a machine learning model is because there are many different formats in which these crash traces come, and these formats are also evolving over time. So instead of hard coding a set of heuristics to make sense of these crash traces, we instead learn them from data. To train a supervised machine learning model, we need some training data to learn from, and specifically what we want to have here are pairs of a trace and the relevance vector. Now, we get this training data from past crashes and the information how these past crashes have been addressed. So we have a crash trace, and we know which files were changed to address this crash. Now, the question that we want to encode in the relevance vector is, which of the lines in the given crash trace are most relevant to locate the fixed files? Now, to compute these relevance vectors, what we do is to tokenize the crash trace lines into individual tokens and to split the file paths of the files that were changed into path segments, and we then compute the overlap between these two sets of tokens and segments, and if there's more overlap, then we assume that a line is more relevant. So looking back at one of the two examples that I've introduced earlier, let's assume that this crash was fixed by changing the file do stuff browser controller.java, then these lines 27, 28 and 29 would be considered most relevant because they contain a lot of tokens that overlap with the path of the file that was touched in order to fix this bug. Now, you know what training data we are using. Let's now have a look at how the model actually works. So it's a neural model, so based on deep learning that takes as an input the crash trace, which we consider line by line by splitting each line into tokens. Each of these lines is encoded using a word to back embedding, which we have pre-trained on the traces. So what we get here is a sequence of embedded tokens for each line. Then each of these lines is fed through a recurrent neural network and RNN, so we have one of those for each line in the crash trace, which gives us a sequence of line vectors. These line vectors are then summarized using another recurrent neural network called the trace level RNN, which summarizes the entire trace into a trace vector. And then finally this trace vector is giving into another layer, a fully connected layer, which then eventually predicts the relevance vector of this crash trace. Let's look back at the big picture. So we've now seen how this first component works that takes the crash trace and identifies the most relevant lines. Let's now have a look at the second component, this information retrieval based search that takes the most relevant lines and matches them against the files in the code base. We formulate this problem of taking a line in a given crash trace and matching it against all the paths in the code base as an information retrieval based search. So usually information retrieval deals with queries that are used to find some documents in a set of documents. Here our query is the line in a crash trace and the documents we're looking at are the paths of all the files in the code base. Now, in order to do this search, we are tokenizing all the words that appear in this given line. For example, we're tokenizing one of the lines in the example that you've seen earlier into these words. And we also tokenize all the paths that we have in the code base by tokenizing them into path segments. So given this path, for example, we would basically split it in all the slashes and at the dot. And then what the information retrieval based search is doing is to match the tokenized paths in the code base against the words in this line and tells us which of these file paths is closest to the given line. We evaluate scaffold on a set of about 20,000 crashes that have occurred in various Facebook products running on Android, iOS, and also on the server side where it's mostly PHP code. As the code base, we are considering the monorepository that Facebook is using to host almost all of its code. So it's a huge repository that contains millions of code files written in many different languages. The data we consider has been gathered over a four year period. And in order to make the setup realistic, we split this four year period into 50 day steps. And then at a given point in time, we take all the data that has occurred. So all the crashes that have occurred until this point in time as training data to train our model. And then evaluate it using the crashes that occur in the next 50 days. The main result I want to discuss here in the talk is the end-to-end effectiveness of scaffold. So basically how often it successfully predicts the right file or at least one of the right files that need to be changed in order to fix a crash. What you see here is the effectiveness among the top five predictions given the raw crash traces. So on the horizontal axis, you see all the different dates using this 50 day steps that we are considering. And then on the vertical axis, you see the percentage of correctly predicted locations for the crashes that occur in the 50 day window. So the line you see here is the effectiveness of scaffold. And as you can see, it differs a bit depending on which 50 day window we are talking about, but overall is in this 60 plus minus a few percent range. So overall it works pretty well and can predict many of the bug locations effectively. We compared scaffold to a couple of interesting baselines. One of them is a heuristic logic that is currently at use at Facebook. So this is a manually written set of heuristics that tries to look at these raw crash traces and also tries to identify the files given a couple of heuristics. As you can see, scaffold outperforms this heuristic logic. And it's not only better in terms of effectiveness, it also has the advantage that it's not manually written but automatically learned. So it can evolve very easily when the crash traces or the code base or the languages used are evolving. We also consider another baseline which is an end to end information retrieval based search that essentially takes the second component of scaffold and feeds the entire crash trace into the information retrieval system. So here we're trying to match the entire crash trace against the paths in the code base. But as you can see, this doesn't really work very well which shows the importance of this idea of decomposing the problem into two sub-problems where we first identify the relevant lines and then match only these relevant lines against the file paths in the code base. In another experiment, we did not look at all the raw crash traces but extracted only the stack traces because this is what a lot of prior work is focusing on. So as you can see here, scaffold again works pretty well. So even if you only consider the crash traces, it reaches roughly 60% accuracy in predicting one of the right locations. We compared scaffold to one baseline that was suggested in prior work which assumes that the most relevant lines in the stack trace are at the top of the stack trace. We call this baseline first lines first. But as you can see, this actually does not work that well because the first lines are not always the most important lines. We again also consider this end-to-end information retrieval based search but again, it does not work very well. So again, it's important to decompose the problem and first identify the relevant lines in the given crash trace. One interesting question to ask is of course why does this approach work or also why does it sometimes not work? We have identified three reasons that basically determine whether scaffold works or not. One of them is about whether the file that is buggy is actually mentioned somewhere in the trace. So because scaffold takes only the crash trace as an input and assumes that there is a line somewhere that tells us something about the file that needs to be changed, it does work if there actually is such a line but it does not work if the file is not mentioned at all. The second reason is about how well scaffold can handle partial information about these files. So sometimes you do not have the full file path mentioned in the crash trace but maybe just the name of the file or maybe the path has changed a little bit and the question is how well scaffold handles this partial information? If it does, then it typically works and if it doesn't, then it doesn't work. Finally, what's also important is how well scaffold understands the structure of these crash traces. So we found that it actually has a pretty good understanding of the format of these traces. For example, it figures out where the stack trace is. It figures out which out of multiple stack traces if they're multiple to use and it also learns which stack trace elements are relevant or not so relevant. For example, it learns that bugs are usually not in the Android framework if it's an Android crash but rather in some code that is really in the code base we're talking about here. As usual, there are of course many more results in the paper so if you're interested, please have a look at the paper. In particular, what we discussed, there are more details on the dataset. We also look at the effectiveness of only the trace line model. Here in the talk, I've only focused on the end-to-end effectiveness of the entire approach. We also talk about efficiency, essentially showing that training this model has moderate computational requirements and that querying it is very fast and we also have a more detailed comparison with different existing baselines. So this brings me to the conclusion of this talk. I've presented Scaffle, our new approach for bug localization on millions of files. One of the key ideas of this approach is that you can decompose this problem into two easier sub-problems. The first is to identify the most relevant lines in the given crash trace and the second sub-problem is to match these likely relevant lines against paths in the code base. Scaffle is learning-based and language-independent, which has the advantage that it's pretty easy to apply it to other code bases or to adapt to an evolving code base. So for example, if a new programming language is becoming more popular or if the format of these crash traces is changing, then you do not really have to change the approach itself but you only have to retrain its model and then it will still continue to work. That's all I have. Thank you very much for your attention.