 Hello everybody, I'm Alberto Massitta and today we are going to talk about the teaching machines to read the code changes And actually to print to tell what happened. So predicting a common message with neural networks Just a brief word. I were well to thank my company who sent me here provided the all the hardware for the training for any Experiments providing the time for me to run the experiments and we basically are consulting company completely voted to open source which is pretty cool, and we only use open source technology and We are an Italian based company. We do devops cloud ML big data and lots of interesting stuff So why we want a common message Suggesture and Is this presentation actually? usually Well, okay What do you want to come in message? Suggesture for example, but because we want to do to just help the developer for example So we want to get given head to the To the developer or for example, we want to catch bad comic messages Just imagine if your Jenkins pipeline was able to reject pull requests Bayser all the lousy comic message that was attached. I think we all been there, right? So it will be nice if you had an automated way to generate stuff To use it as similarity metrics and how good a comic message is What we don't want anyway is a message based on templates because message based on templates just suck They can just tell you a very very narrow Cases and we don't want just a nice crafted message where you feel just names Instead will really generate the message which pertains the very specific case you're tackling and We don't want also a message to summarize what has changed because what has changed is trivial This file changes in this way. This is the point of the git patch the git diff message itself We instead we want a message to capture the high At least the medium level intents of the coder and why it changed, okay? So turns out the generating a message for a comic for a code change is a problem of summarization and We want to generalize. What was the intent of the coder? At least a low level. Why did it that and Maybe we are not exposed to the high level of the project because you cannot even tell it by looking at the comic history But we want to know what is changing in that particular context a change of code is always comes with a comic message Which describes the full change so in essence? right when the developer writes a comic message it's generating a summary of the changes and We want to exploit particularly this thing This patch provides a really focused source of code to summary mapping So in this case we change on only this line and we replace with this and this one is being copied because I want to generalize for multiple items and So the point of the learning in this case is learning a code to summary mapping. What can we use for that? Neural networks neural networks are particularly good at general generalizing mapping from source to target and Machine translation can actually help a lot because the whole point of the statistical machine translation and later on Nero machine translation nowadays is to infer a mapping between languages and By means of counting the core currencies of words between two different parallel corporas or vector embedding manipulation what I mean is that if we take a sentence and We like a map it into a dense vector of the float numbers Right much like we do with the TF IDF in information material thing and we Put it into a multi-dimensional space in a high dimensional space like 256 dimensions Concepts that will be very really near will be near in this space And this holds to across different languages So we can map from a language to another by looking at vectors at concepts that sits near in the language So in order to do a machine translation we need an architecture first and a data set The architecture that we picked in this case is the Google Nero machine translation Architecture, which is the state of the art for the machine translation Basically, this is a sequence-to-sequence model with attention we have here the inputs and Are fed into a bi-directional recurring neural network with the LSTM cell in order to prevent the vanishing gradient problem Which is the phenomenon by which the neural network tends to forget what has seen In a state which is far into the computation and so it's not able to learn Then we feed these double paths forward and backward paths to a stack of eight layers Which have also residual connection residual connection is when you just you not only feed the input of previous layers into a new one But you also feed the original input and you concatenate together This because it was observed that producing the intermediate representation of a result along with the input helps generalizing better Because the gradients when you are back propagating flows again across the The residual connections and are not hampered basically Then you will feed everything to a tension model, which is a nice machine that sits in between and helps Understanding what is the part of the input that should be used across all the multiple Words that you have been feeding in to generate the current state in a decoder The decoder is the network on the right, which is the run Which is the one which is currently is really generating the sentences, okay So this one is getting the context of the diff patch Everything that's changed the file names changed the code that changed it feeds into a decoder which generates By means of our current neural network it generates the real words that will be fed and also this one is a stack at the eight-fold the recurrent neural network with Is monotonic is not by directional with residual connections And then we we needed that data set so we use the comic data set provided by the Giang and Macmillan experiment which this presentation takes inspiration from Which was a research down two years ago about how to extract Comet messages from these patches they use a totally different technology because they didn't have this at the time So we are actually Using better weapons to take all the same problem. It's two million comets from the top 1000 Jiva repositories on github We extracted only the first sentences from each commit message. So and then we Basically stripped away the issuer the commit hash We tokenized everything for with white space keeping the candle casing and punctuation because it's part of the language and Then we basically had to cut out everything which was longer than 100 words Being in this patch or be in the comet Commit message and these left us with 75 commits and then we apply another filter Which was described in that experiment of three years ago by John Macmillan, which was a try only keeping the verb Dar objects messages. So if I take a bad commit message like Blobs of code and then fix I cannot really and there's lots of them I cannot feed it to the As a training sample because it will make my model produce garbage So the best commit messages are those which begin with a verb Added this file updated the changelog. I don't know removed this kind of import and Then there is a direct object after that so it really is a good message that summarizes It was a very heuristic way to tell bad commit messages from good one and try to have a kind of unsupervised way to separate Bad messages, which should never be fed as a training samples And so we left with the 30,000 30,000 commits split in three different sets The biggest for training then one for validation during the one one for testing doing the training and one for validation after training was done to prevent the picking We use the sockeye which is a deep learning framework for sequence to sequence much like a tensor to tensor Based on AWS MX net, which is a really cool framework by the way, even though it's a really an underdog of the field Everybody is just in love with pie torch and TensorFlow. I like MX net because it's really polished. They're really well designed from from ground up and the training happened of AWS itself over a Tesla k80 and a Tesla v100 and I want to spend a few words about the fact that the new the new GPU that Nvidia released is a total beast and It costs four times much, but it's four times the faster So they say that the only thing that money can't buy is time well in deep learning this buys you a lot of time and I The original experiment that was a 38 hours training length. I ended up in five hours So it really let me iterate a lot and you can see that From the perplexity of the output generated through the training You can see the model really picked up Instantaneously and there was no refitting. This is the same graph over logarithmic scale So you can see the model still has a long way in this apparently flat Field here and it's theory really picked it picking up and it stopped by itself after entering that plateau So the results five hours later, which is the 242 epochs and 43,000 mini batches are these and these are actual actual figures that I Ripped out from yesterday evening while I was Work frantically scrapping through my presentation to tweak it So I removed these two lines and these the human their original Comet on the validation set so no peaking here So remove the no needed import the machine translation said remove unused import so the machine really generate generate that this message just looking at this one and In this case the messages are actually equal So add the table of contents in Python read me and the machine said really the same So this makes me think there's a kind of a repetition pattern in the training data The train that was shuffled and was Divided so it should be really I was really could not believe it like this because I thought the machine would have tweaked a little and This one is nice Here we are updating the version of Gradle and So update Gradle and the machine said update a build tools version so great so he knows that the greater is a build tool and He understood that I was updating the version actually This is my favorite so version one to two three in the palm file of Maven playing update the OS maven plug-in to fix an issue with IntelliJ idea on Windows There was no way the machine could actually know that on Windows Delber had a problem and so the machine said upgrade the OS modern plug-in to fix the build issue because if you are fixing the palm file it means the other build issue and This is the actual plot of the neurons in the attention model So, okay, let me just go through this here. You can see The patch the coming message. So is everything like this crammed is in just one line, okay? And you can see here the message that was output upgrade OS maven plug-in to fix the build issue It's really crammed because these are fixed the representation graph And so the labels jump one another and you can see that by look these are the neurons who got activated So whenever you see blank space you actually are seeing Neurons that didn't fire up so you it's a magic a matrix multiplication So if you multiply your input in a matrix you obtain other matrix as an output or you obtain an output field and So the neurons who actually have the highest core they have activated Okay, like this standard neural network stuff and the neurons who have a low score actually didn't fire up This because there was no matching like a template between the input and the output like the input vector running over them over the matrix Okay, so here we are plotting the real float numbers as a Grayscale so the highest the numbers the highest the correlation between these inputs and These output here so you can see the upgrade OS maven Actually triggered when it's so artifact ID words and it's so maven and Here to fix the build issue triggered a lot when he said when he saw Maven so maven is a Synonym for Build issue because if you're changing your pump file it means that you have a serious problem building up that's what the machine learned actually and The graphs goes this way because you start from the first word which are the top right there and here So the gradients falls like this So Was everything actually success well For those who are not acquainted with machine translation lingo lingo the The metric that we use the machine translation is a cold blue which is a bilingual evaluation You I don't remember what is but the the fact is that we actually compare the output of our machine Translation and the human translation and we come to the n grams in the words in common between the original and this one It's it is scaled between zero and one hundred It's actually impossible to get one hundred even for human because it means that you should have written exactly the same thing So this metric is not very good by the standard and it tells you if you have a very low Blest score a low blest score is actually 15 10 your model sucks if you have a high blest score and the fluency starts to popping in at 1820 so for example 26 is a nice blest score and State of the art which is Google translate on fluent language is 34 The experiment the original experiments of the gentleman Milan was 33. We obtained 37 because we were armed with a better architecture And this the Character n gram f score of 40 which is stellar so precision recall is 40 percent, which is really really high Watch out if you get a too high blest score like 70 it means that something is totally wrong because You're just copying you're just feeding the validation input as a training sequence So you cannot just learn that so the model has learned the fluent English it outputs perfect English phrase and It spotted very interesting correlations in short comet and patches So, okay, you talked briefly about history a success Why do you talk about failure? Well, because then I said, okay, we are really using a very constrained Data set. Let's just try to remove a lot of constraint and see how it performs of wild comets well Actually, I didn't get the very good results because the error rate for long patches is embarrassing a lot of sentences Although our good English are totally incoherent with their inputs And that's why the data set is so piqued so well chosen for example, and I have a lot of these The human I won't show you the patch because doesn't matter, but the human comic was a change the fold FBO cache size to zero Machine add the news and import for no pass. There was no no pass Thing anywhere. There was no imports. There was nothing related to that So why the reason and these are the things that got gets interesting So you learn the most of the technology when you watch it fail You know you look at stuck at race and you know how the dependencies are doing you see an edge case And you know better your model So this turns out to be an extremely difficult task in practice Because vanilla machine translation architecture is not the tuned for this particular task So for example, we have when you're translating English to French you have the sentences which are kind Same length approximately Actually translating English to French is so easy that is basically not taken in even into account because it's Too much of a easy task instead. We shifted the towards English to German because the The formation process of the sentences it's much is very difficult in German which you are concatenate stuff respect to English and But you have a kind of a growing Ratio which makes the sentence is not too much in balance So 30 words against 40 words maybe or 20 because you concatenate Instead here with deep patches and a pityful comic message. You have an imbalance of which is tenfold and This what leads us to the fact that the decoder is fluent because the output is always within the ten tokens on average ten words so you actually have a Recurring their networks with unfolds ten times is able to generalize well and to output fluent English But then you have very poor context performance So if you output a very fluent English sentence, which is totally Incorrent with this patch it means that the encoder network This one remember is not able to condition Appropriately the decoder So you are not able to capture any meaningful context to fit with the coder in order to instruct it What what should be generating so it generates random fluent words? So totally gibberish, which looks like very nice English but you're not conditioning and Not even a human can remember a 500 words context And this is not about a vanishing gradient problem because the LSTM is able to prevent that But where 500 length it cannot just carry out enough context and the attention model Who just is supposed to link the exact input word to the exact output word cannot keep up with such lengthy inputs and This is our fault actually because the diff patch It's too complex. It contains the insertions ablations Context things and so we are just cramming together too much stuff. So this approach With this current architecture is just doomed to Take so much further Also, there are memory problems because if you are unfolding 500 the times are occurring on network your memory will explode So a guru near machine translation works. Well the transformer, which is the state of the art In sequence to sequence model goes out of memory Instantaneously as soon as the training just begins So I'm proposing here a better architecture So the main source of chaos Stems from the input length and the complexity because we are cramming together insertions So new lines of coding green ablations the red lines We are moving and the context the white lines. That's just telling you where you are It will make much more sense to adopt a multi encoder network in which we use an encoder for insertion one for ablation one for context and then a hierarchical attention network to tell What with is the encoder which is not influence should be actually this regarded them and then just want to go there to funnel out everything outside Which is much in the spirit of the transformer actually because the transformer is a sequence to sequence network Which don't use any recurrent neural networks and just uses attention Machines with multi attention heads eight of them actually looking attending at different parts of the code And so you have different Things that different moving parts Attending to different parts of the input and the output and that's give you gives it a lot of nice performances so remember this which you have a Very natural way to separate context because you know it is the ablation and this is the insertion and Everything which doesn't have a minus or plus is just the context So what I propose is that we get this very same thing blurred out and We stash the lines pertaining to a context there no matter where they are into a separate encoder and The same holds for ablation insertion. We get an attention Modeling between just to tell what are the relevant parts and then we feed the all these three attentions throughout a global attention and That goes into a decoder The input complexity the advantage is that it's factored into sub parts The speed is unimpacted because you have the same number of the matrix multiplication plus three and Same number. I mean one for each Time you unfold so if you have a 500 long Input you do 500 at least the matrix multiplication area in reality There are many more because you matmool for the input You matmool each time you compute a new step and you matmool for the output So three times instead since you are factorizing the 500 in one hundred one hundred one hundred If 300 actually You have the same number you only have three and the precision is expected to actually improve a lot better So instead of a traditional attention where you actually take a source state You do a dot product tension between all of these then you score them between zero and one to say what is the state which is actually Most pertaining to that current state and then you move you sum together all these faded states except for the one which is The most pertaining and you feed it to the decoder I'm proposing to Doing this so you take the hidden state at zero You feed into the ablation attention and insertion attention for example, and you compute the weight Against the ablation. So what are we supposed to get out of the ablation lines? Okay, how the current state how the probability of generating the next English word? correlates with the Lines that were actually deleted and you do the same for the lines that were inserted and then you feed the Two attentions into a global attention vector in which the inputs are not even states but are the Summation the attention output and in that global tension you decide which is the next word that you should be Generating based on the fact that you are seeing these Ablation context and this insertion context. So no more about the current Just one single flat space no matter if it's ablation insertion But you just factor out by hierarchically distributing the complexity this is to be coded of course because it's a proposal and It's a thing that I'm I like to actually realize is not that technically difficult actually. It's just a novel architecture and This basically is the end of presentation. You find everything in my github repository h&i vanilla mirror now a comic Suggestor everything is on the slides. You can download the website. Thanks everybody for your attention Sure if you don't do any question, I will keep telling jokes no questions Yeah, please Okay, so the question was why If it was feasible to use Generative adversarial networks, which are kind of the neural network texture in which you have two Networks which play a game one against each other one tries to generate a nice sentence and the other one has to detect if it was generated by human or by machine and So you have a generator and a discriminator which tries not to get fooled by the other and in the training They compete to be to be better and in the end you have yes, this is perfectly doable because Gants can be actually used for natural language processing. It will be a novel approach which will make totally sense to try They actually have a very different training scheme because they use reinforcement learning instead of vanilla back propagation But yep, it'll be totally Nice thing, please do and come here next year to tell us how it's gone Thanks. Another question So question is is there any publicly available? Study which tells if there is a computer Programming language, which is translated handled best There was a multi language corpus in a study was actually about the generating Comments and they gather together different corpuses of Of coders or they were Python. They were JavaScript them and Actually the performance Was not too Too different and in that case it makes it makes the difference having a very rich corpus High-quiety corpus. So for example, I expect C code to be of a high-quiety because the kernel messages the comets in the Kernel Linux kernel repository if you do a bad comic message New store walls is going to bang your door in person and actually kick you very roughly And so there is a kind of rain of terror which ensures that the people stays in line and don't commit Bullshit into the repository. So I just expect that If the project series you will have a nice training data and you will be able to generalize better prediction question If if I if I check it if the comic messages fit Please can you please talk louder because it's very faint from here? Yeah, sure here The blue score actually was a compare it Okay, sorry, sorry So the question is have you tried verifying if the comic messages produced by the human produced by machine actually match those Reversed So okay, okay The question is really tricky. So if I Looked at the comic message by human and how they correlate with the actual code changes. Well, I Can use this as a way to try to Score the messages computed by humans But there's the problem that I could get if I use this kind of metric for example Which is counts the n grams in commons. So the Subsection of card contiguous characters in common of the sentences. So these two Are very different. So these will get a low score, but actually they are pretty good. So It's really difficult to establish metric to evaluate if the The message the original message actually correlated very well To the comic change that you did The same for this one update gradle. This will get a very shitty blue score also because of the capitalization The which is different and so the n gram will not fit There was another message in which there was a typo in the original commit message So update change log without the end and the machine output update change log So actually was able to generalize the correct spelling that will get another low score So it's really difficult to correlate The quality of the message because the message usually is a low quality and that's the whole point of coming out with this kind of group Goldberg contraption Yeah question Okay So question was Since you are actually trying to learning to translate between something that you've seen Right. How can you actually cope with things that you haven't seen? Well, that's the problem of out of vocabulary words and your machine translation suffers a loss from these because it's not able to generate to generate Float vector of something hasn't seen so there are there is an effort that is factorizing the words into sub words and so that you can cope with n grams But Generally, it does a good job because you use we use frameworks So there's a lot of stuff which is in common different projects And this tells us about the nature of software development in which most of the code is not ours by some of the framework and there are conventions naming conventions which helps a little and Otherwise, yeah, you get a rubbish thing which is not able to compute but most of the times it goes well and The blue the very high low score is a very encouraging