 that obviously came in mind way back when our renders actually was the first large part because it was really large than we thought it was. There was nobody as easy to indicate that it was going through the results that it was in. So as I'm here now, we've both had a background in working in social security specifically. And one of the immediate thought processes that we had once this started happening with the foundation of people and once a lot of open source providers are coming about with what are the applications, when the relevant source kind of competes with what is happening from a commercial standpoint, maybe we can create one more, maybe we can run one more. And just in the time between when we proposed this to now we're going to bring this, everything has changed, right? People thought you would not be able to run one more locally or you're not going to run it on an app alone so there's not more energy to run. And it doesn't ever end. But what you're going to talk about is kind of a specific part. We're looking at large language models for code. Again, when we first started, there were separated models for code. Even openly I had codecs and now it's not over there. It seems that there may be just general code with large language models. But then the list goes on, right? All the way from code bird which was kind of defined here in some sense to alpha code, code JNG, and then this codex, kind of coder, for those of you who are always code generating LLM for pilot, it's a version of co-pilot, which kind of, kind of do the trauma of creating on a specific data two days ago. This morning Amazon made public for Visper so everybody can use it and it's free for individual developers. So code generation for that model is great. I don't know about your experience, but my experience with using JGPD but in any code was similar to how I used Stack Overflow, how I remember looking back at that time where you were like, great, I just need to ask it and it'll fix it for me. I need to think about how to actually fix it. So this is me. I really liked it. I know people are buying an example of AI and they go with their jobs, but I think that was definitely happy, like less body code and less unit testing, less fixing of stuff. So that's great. No more readability, no more comments, and so on. So it's a lot of people for code like this thing and there's open source that's on the issue of models. I think it's a long run open source model to me as we have seen with all software, right? Open source models will be kind of the defining of Korea and there are certain things beyond what open source generally does for software. There's certain principles that we all are aware of but when it comes to specifically five hundred more than AI, there are a few additional advantages that you know we see. One is data security privacy, like the ability to create models on your own data securely without having to give it to a third party. That seems like a major way. Alignment and security seems to be another win from a decentralization standpoint. You know, if you believe that the collective wisdom is what will help us define alignment for AI, then obviously building an open source model will be the best better outcome. There's other boss advantages. As it stands right now, Live Alignment models will always be computationally and as a result emotionally expensive. So open source models should win out in the long run for that. And there are other aspects, right? Training data. If you are using an open source model and you just do a very simplistic extension of the open source code licensing models out there, you could theoretically train on a lot more data because you're open sourcing your outputs or as a result you could theoretically train on GPL GPL data or any kind of popular data as well as long as you are able to reciprocate the license and so on. So it opens up much larger data sets for you. So we think open source models should win and when we look at coding though there is still obviously a gap. Now that's the long run in the short run. The coding models aren't that good, right? Or any open source model isn't yet that good. We're getting there but there is a big difference. So when you look at code generation which is your left to right code generation, writing new code, right? Point of data code. Asking you to write certain amount of code for a given problem statement. The commercial models go very good job. But then as a Sankhya and I were talking one thing that we realized is a lot of code writing is actually not doing left to right code generation from scratch. It is actually editing code. It's about fixing code. It's about maintaining code. It's about extending and modifying code. So when you look at something called fill in the middle which is kind of on the right side these are aspects where you're basically saying that there's a visiting code and a post script code and then you ask a lot more to generate code to fill in the middle. And based on the research that was available it became very clear that while for code generation the open source model like code generation and Sankhya were at a very low pass at one rate. Pass at one. I think this is just to simplify is basically molecular number by 100 is the percentage rate of success at the first attempt that you ask the other lab or the AI to do something right. So you would have very varying differences between commercial models and open source models. But when you fill in the middle for code generation the difference was a lot smaller. In fact you have open source models being a little bit better. Why is that? There are a lot of speculation symptoms of why that is. One is that it needs to smaller model the other is that it's primarily on code so it's not necessarily required to be constrained by all the limitations of natural language but effectively we saw this one opportunity because we were going to find a use case that open source models could use as a first victory against commercial models and within our understanding of the space. And another thing that we saw and this goes back to our experience with open source security and so on is that there was kind of this report by Stanford in late December which talked about how the generated code was going to be insecure. And we had had that experience where I say let's scan this so we have a security startup and it works on application security and we've had that experience where end users and clients really struggle to fix one of these in their code. So those two things combined gave us an interesting idea which was like since middle was basically editing code and there was a concern about insecure code even with generated code and otherwise can we find a way to use open source hotline with models to fix one of these can we do it in a way that's potentially better than what's out there and that's kind of the thesis of what we tried to do and to be honest most of the looks are not as smart as he is but this was kind of the idea that we were looking for and I think Asantya will not talk about all the exploration that we did since then. Thanks Rohan. So when we looked at this problem back in late December there were very few models for source code which actually they were filled in the middle of bug fixing right and one of the models that came in December last year was this thing called Santa Coda. There are some attractive properties about how they've been trained. So first of all it's actually from big code which is this large scientific open collaboration it's backed by having taste and service now and the key difference compared to what's what OpenAI or what others did is that they trained it on dataset where they respected the people's licenses so it's trained on this dataset called the STAT. It contains 6 terabytes of permissive data so they scan data they analyze all the millions of repos but they extracted the licensing from the positives and what you see is the eventual dataset which came in after new application actually it consists only of these set of licenses so one of the benefits if you're using Santa Coda is that the code that it spews out is actually permissive license compared to say OpenAI or whatever so that's one of the benefit and the second thing is they actually have an opt out so even if you're an open source developer and so they have this dataset available you can just submit your repository onto it the next time when they train the model for the next iteration your code is removed and they they've done it once a month or so so if you really want your code not to be included there are the mechanism to do it and I don't know anyone else even including the commercial companies which provide this way so there are a lot of properties of this dataset and the model it still is very diverse so if you look at the terms of programming languages so the checkpoints that they released in December are for 3 languages Java, JavaScript and Python were probably the 3 most common languages but it has many different languages the dataset itself has in it and now everyone has achieved a lot of changes in December so this is still a GPT2 style model which is at the time it was released it was competitive with what was available but probably today with GPT4 or whatever it's no longer that competitive for code generation so this is the base model which we started and very quickly I'll just show you a little bit of how it works so let me go back here so I just wanted to show you the them that they have but the good thing is about open sources that I could just and I was sitting back there I could just look at what was the issue so my problem is I don't have a GPU grant so it takes a while to run but one of the things to remember about this model is that it is another like a dialogue model so people who are very familiar with Chargibri are not expected to say you need this answer you have to use it like code completion so you start out writing either some doc strings or you write comments of what you want to do and then the model will actually generate it so in this case over here what you can see is so I gave it a prompt and generated this piece of Python for detecting all numbers and then I think this started on its own the good thing about this model compared to some of the other models which is open source which does infilling is imported by Facebook you can actually do infilling so you can actually define a token here so instead of actually just doing completion you can actually say I think there is some bug here or there is something missing and I want you to fill this bar here and then let's just say for a few seconds hopefully you will fill a output so you say generated actual code so this part where you can actually go and do this infilling and that's important for the use case model for so let me go back to my slides so when we looked at it we wanted to apply this model for a particular downstream task and the task that we had in mind was vulnerability fixing the most common way to do it is using supervised Python typically a GPT style model is trained using causal language modeling so you have some context on the left hand side and then you have a place where you want to predict the mask and then that's the token that you predict so it goes from left to right this is in contrast to something called mask time build modeling where you may actually in the text itself have cases where it opens the mark as marks and then you write predict what the token is with so an example of a mask time build is worth which is a so then how I just showed you an example of how you could do infilling with so how does the model which is you can do style you can actually do infilling even the same with a causal language model so there's a technique which came last year where people what they did is they realized that you can actually convert this infill kind of problem to a causal language modeling problem and what do you do is you just take this token you have some new plugins to say okay prefix and then add some code and then add some suffix and then this is the middle so what you do is you transform this by generating another token and then this is the problem that you predict the model so if you take some code here where you actually issue a backup from right so you move the you use special tokens to mark in the code what is the beginning and end of the text of the code and then you pick it back up right so we use this with the idea to do to prepare a data set for bug fixing so what we do is like we have some code and then there's a buggy line so for this work we focused on single line fixes because it was just simple and easy to do but similar idea could be extended to multiple lines within the same plugins right so what we do is then we have to create some problem for the model to consider while fixing so then we say okay there's a bug this is the CWD so these are all vulnerabilities we format the actual bug and then we put in a fixed version for the fixed line and then we insert these scheduled opens so then we say okay this is the prefix there is something in the middle and then there is a suffix and then there is the text so all the data actually consists looks like this so right to give you a more concrete example so this is an actual bug we are going to put it in our program so this is from our real CVE so we transform so this is the actual line of code which is there right and we add this bug and there's not a fixed line and then we insert the actual fix so this whole thing is actually the input for training so our end dataset consists of examples like this right and then you can train it in a standard way because it's a GPT2 style model so you can use the there's a special findering script which the code has provided all you could use the usual to transform a script just need to be careful because you don't have to add these special tokens otherwise the model will not know what these can provide so make sure you have special token in and then you can run it on single GPU very easily right so this is just a test from which I have from code app so we took a CVE fixes dataset which is published last year and from which we extracted these single line fixes so that's the dataset on which we trained our our model the dataset is available we call the standard fixer right so this is the model which has been fine tuned for fixing now once we have this we realize that and this is now again quite popular with the whole changing and giving language model tools but what we could do is you could combine a static analyzer with a last language model to automatically detect and fix vulnerabilities right so the way it works is that you have some input right you scan it with your static analyzer it finds some vulnerability then you use the handle model to prompt it and then you create a fix right then you take a fix and then scan it again with the same analyzer and you do this until you find that it's being fixed right and then you go to the fix so in our work the greatest tool which is also open source and I'll show you in a minute how it works for all of this which is the open source static analyzer and for the last one we use standard fixer but basically you can use any combination you could use a commercial static analyzer or you could use another model to do the generation you could even use an open ai's api to actually you want to generate code right but the idea is that you use this closed loop to generate different options and then you scan it continuously with your analyzer until you finally fix the problem right so let me give you a quick view of how it works so the tool is open source where you can have instructions on how to use it and I will just run this here and while I talk about it hopefully by the time I am done I will have a fix issue right so here is my show you the Java one so this is an actual problem so there is actually a vulnerability here there is a parameter which comes in and then you are directly creating a file from it and then this could be an issue so if you run it it creates an intermediate you know the prompt file so first I am going to take the vulnerability and then we generate this kind of a prompt so it says this is the vulnerability we are not restricting the path name directory this was the line which is commented and then it prompts the model to actually in fill of it is the bug using fill here right eventually once it is finished it could have something like this right so now we are going to generate until the code here so instead of getting used to the image it puts it in one attribute and so on here is an example from this which might be more complicated so here we have we are using md5 hash so when we scan we would see that something that could actually flag it as the user of the public skill this is md5 you should use it in every prompt and then when you see how it is eventually fixed we have the fix here so it actually says we use to this is so this is running so it is actually quite nice to watch this is an example I recorded from earlier so you can see that we modernize various combinations and keep on iterating until I find something where the this is not one of it so you can see it is actually trying out different combinations this is the example that I just gave in so something the issue here is that this is coming from request parameters and then it is being passed in the response directory and you can see it is trying out various combinations I don't think it worked this time okay let me go back so we built this and then we wanted to see how good it would work in practice so we tried to test it on our data set for 1000 projects from Vita we just scanned using the sender and this data set is also something we have published typically in the way you would have seen the results I reported I was like pass at 1 and pass at 10 but what it means is that pass at 1 and then you just take one generation output and so it is able to fix like 25% with just one pass or one output but if you do typically 10 like the one which I showed you before by default we do 10 you can actually fix half of the thing then this is for various languages I didn't have the capacity of the GPU also some people typically report yeah that's that's pretty much it so the model as well as the tool it's open source it's available we just published it so I'll invite you guys to try it out and that's our talk, happy to pick any questions yeah so I'm guessing what you mentioned to who is it it's V1 Local it's on small and so since you've designed it with your ideas for making more portable that way it runs time fast so yeah so I think one thing is I mean I don't have a GPU so it runs fairly so this is running on my CPU so whatever you said so it's already so this particular model it's not too bad because like I showed you it's still a gpu2 style model so it has like less than close to 1 billion parameters but there are things which just today I think in the latest release of having-based transformers there is a gpu2 big which is a fast insurance based transformer which actually is being contributed by Santa Boat that will actually speed it up quite a bit there is some other work people have done which we have not tried with the around seeing how it can actually compress some of the weights so ideal would be if you can take a model like this which is fine tuned on very specific problems like one already fixed it you know less than 50 and be a something so then I can put it in my visual studio board as a plug-in so I think it's possible but we haven't spent too much time on that but what you saw today is this CPU so in GPU it's actually fairly fast yeah so just to which actually we were talking about a gpu2 style model are you training from scratch as in you know the based models so so yes so the big co-collaboration train the santa coda model from scratch on the stack the data set which is the data set we find it from a particular downstream problem but the original model the santa coda is and that is the only one which I know has been like you can't get code which is not right it's not going on do you where do you see human AI collaboration going as far as programming language evolution so you have any ideas? we thought about it can you repeat the question so when you see human AI collaboration and I think in the past instance right now you can't let these tools run unsupervised right even for example the pass of time that we're doing here it only makes sure that the study analyzer no longer say this one but you might have broken the entire code because of this stuff so there's still a lot of supervision human supervision required even if you are running AI to do all the work even if you don't write a single one of code then you still need supervision personally don't see that going away any time soon especially for code maintenance writing new boilerplate code yes it will be largely unsupervised but that is never the problem when you go in large organizations, large projects I think that the the maintenance and the filing I think Sanjay had some ideas on this as well you need more that are trained on that and even then they would be in supervision so I mean people get really freaked out about this every day there is some new set of tools and chains and people talking about auto GPD and you can just tell them what to do it keeps running I think they're getting a lot ahead of themselves so all those cases that you see they are actually very very simple so I believe when I see a lot I will model to install I don't know NVIDIA driver on a new jetpack I mean it puts hours for me to so those who are actually familiar to make this work so I don't believe if you can install the NVIDIA jetpack on a embedded device from strategy since it's still a lot of room it's I think for simple tasks which I believe maybe 80% of the development is like that take something from a database display on a UI and then maybe that's where it's hanging but there's a lot of scope there's still a lot of room and there are ways in which we could improve so one of the things we wanted to do was like we don't really learn from the failure of the like it doesn't have to fix like we don't explain it so one of the things we could do is to see like ask the model to explain why it's failing but in all this talk like yeah I mean this is a running joke like I would believe like it could install NVIDIA driver on a machine because that's insane if you have to do it because like at security I've worked with a lot of embedded systems recently and you need a specific version of Python with a specific version of people who are laughing right the jetpack with a specific one and it takes two days for me to take actual device commercially from level 7 up and then go to the point where I can run my code and right so I think yeah I think there's still a lot of room so we there it's a very specific category to what we said like stack also at this point so do I use it? Yes I use it but I use it as similar to how I would do like for a look for some information get something out and then do so the other problem is it's hard to actually get the model to what you really want if it goes down this path of like it keeps telling you the wrong thing right so I will teach at SMU at this finish the course on thinking and some of the problems there I mean like a problem says on so we teach about this heuristics and so on right so we'll try to ask it and then try to change something in a way that you want right like you would to another human they would take that and put an app on it but if you try some of this model they just keep going at this track what they call like installation in the world but it doesn't backtrack so once it starts on a path and every time you say it and say I apologize this is not correct but it will just spit out something which is again garbage so yeah I would like to teach you to write out what you realize for problems for problems which probably is not being done or you're trying to ask a particular question it's hard to get into what you want right in today so I think yeah despite all this progress