 Okay, welcome back guys. So now for the next talk we have Damian Sugi He is CTO of Exocat and Exocat Hold on, where is it? Oh, yeah. So Exocat is a company that specializes in PHP code quality for code quality solutions for the industry and he also leads a development of the Exocat static analysis engine It's something that automatically reviews your code For version compatibility security and clear code. So today, he'll be talking about teaching PHP new tricks of machine learning I'll let Damian take the stage. Please give him a warm welcome So welcome everyone. I think after two days three days for the most courageous of you of learning PHP you are going to be very interested to have the only session That will tell you that PHP can do on top of what is doing on instead of you do which is We're going to teach PHP to do things and not code on our own But let PHP understand the problem and solve it for us. How good is that? Awesome. Yeah, wow. And the rest over there is sleeping. Be aware that I will be asking question during the talk So be ready to answer them Yeah, I'm threatening you anyway, so We're going to see how to teach new tricks to PHP without Relying on Dave a who's already reasons master but we're going just to give it the information and PHP will solve the problem on his home and That will also be applied to a very specific problem that we have in PHP as in any other language Which is code in commons So we're going to solve the problem of scouting commons, you know Like everyone is coding and then at some point you decide that you don't want this kind of code or you want to debug it Well, then you're going to leave it in a comment And then it's going to get committed go to production and raise some bugs not exactly But we're going to try to chase them down and remove them from a large code base, which is PHP my enemy So as mentioned, I'm I'm sitting at exact We're not doing machine learning on a regular basis But we're introducing that to refine the results any time we have static analysis at some point The machine do not is not able to go further when we look for Objective issues in the code and we use machine learning to reduce the number of false positives in our research So what is machine learning and I will start with the first question who among you is a parent? Kids one kids two three Don't be shy. It's okay. It's I have them. I have some of them too, right? So very little Not the same kids I guess So machine learning towards machine is for computer. We're going that will be basically PHP here So we're going to have PHP learn some something instead of us learning What is Laravel or what is web security? We're just going to end the task to PHP and that's exactly the way it works And if you think about it the way we work with PHP is very imperative. It's a dog Oh, okay. It's an elephant. It's a dog. We say sit roll and sit again fetch That's we give orders and PHP is just there to listen to the orders. Take the command do it If it's too big, which people do it anyway if it's not Well, it will also do it. But then you know, it's a smart comment, right? So now we're going to change that and be more in the idea of in the print the approach of Teaching a kid. Okay. We're going to show him what to do And it's going to repeat that until we as the teacher are going to be satisfied It's more like a you know, Shafoo kind of stuff of learning So we'll have two different phases The first one will be the training and the second one will be the application So the usage in the ring world without our our supervision What can we do with machine learning? I'm sure you have examples Alpha go the pretty the recent contest of playing go that was mastered and I like the the best The best player of go in the world has been defeated like four to one something like earlier in the In the year that was made only on machine learning with exactly the same principle Although a little bit modest then the one I'm going to show you What can we do? Otherwise OCR optical character recognition That's that's a very difficult task very complex and it means a lot of training and then when you have Running when you have it running you want it to be fast and recognize the character for quickly. So that's a good one Medical diagnostic. So once you realize that your doctor is going to be replaced by a robot That's machine learning probably not with PHP, but anyway If you've seen also robot walking That's a lot of machine learning Just like kids, you know, they learn they learn they see and then at some point it works But this is nice Although it's not going to be very applicable I'm not going to show you how to beat, you know, world master of go using PHP But what can we do in terms of application? What is very interesting in this supervised learning system is that we have a first phase where as developer We're going to work on the content make sure the machine understand and develop the right model Learned correctly and when it's done, we're going to push this model Which is a very tiny file of configuration in production where it will be applied to web Okay, and in terms of web, it's going to be able to answer pretty fast and sometimes saving us lots of complex Calculation through the database Through just pure calculations. So that's where it's a it's Andy for PHP You already used spam, you know spam filters not in PHP, but that's the the same idea or That would be a recommendation systems Think about it like e-commerce. You have people who have They're selected five six items and you want from those five or six item to recommend a seventh So that they will you know add that to their own code. That's machine learning at its best And we're going to do to do that in a very specific problem. That's plaguing us as I said coding comments So what is the difficulty of well the situation with coding comments is that as I said Everyone will add some comments maybe hide some code for debugging purpose for reserve for future use things like that and It will just forget forget it Now if you have a small project like it's the trend nowadays Then probably you'll you'll be able to review five twenty thirty files and just clean all those comments Now if you're called PHP my admin and you have like 800 files and a million line of code and History called code from one year for ten years. You don't want to do that. Well, we're going to try to do that There's um about fourteen thousand comments. How many of us today? We're like 60 in the room. How long do you think we would take for us to read all those comments and clean them? Yeah, I'm waiting for an answer. That's right Yeah, at least an hour, right Maybe we can do that faster than that So this is a classic problem. It's very good for machine learning because it's it's a complex problem. Okay Extracting the eigenvector of an imaginary matrix. That's going to be a piece of cake for PHP, right? Everyone knows that on the other hand reading comments and understanding if the code is partial or not if it's PHP code or Python code or maybe broken codes That's very difficult. We usually require our own intelligence to do that and to be able to sort the comments and so That's a complex problem, which has a large number of good reasons to To conclude so peer machine learning is going to be very good for that We also have a lot of expertise available. I'm sure you're all expert in comments Which is the moment you say yes? You've all been writing comments, right? Yeah, so you know you can put lots of things in it drawings codes insults Sadly enough. It's too often So here is the synopsis for today As you can see there are two branches. The first one is going to be the training We start with data history data. We start with our own experience. We start with this data We do the training and then we'll end up the end of the first phase with the model and the model can be stored Okay, so we stop at that point second part will take in PHP my admin all its comments Put the pattern push them through in the model and have our actual results simple enough Good, so we start with that the first element. We need is an engine Okay, we need something that is inside PHP that will that will actually process the incoming data and Produce the model and that's the fan extension If I remember correctly fast artificial neural network and That has been a code that has been around this library That's library that has been around for at least two decades You're going to see that later. That's that shows there's a lot of history Behavior that's are funny to find but it's also been available to PHP since PHP for I think it's working on PHP 7 Thanks to Yakub the Zelenka who's been very kind to push that very fast Although apparently there's not too many people using the extension but it works. So that's a side Compilation don't need we speak old install it and it runs very simple And it's going to bring to bring us the neural networks to PHP Who's been using their neural networks? one two Only two well three continue me I guess Okay, so let me try again who's been using his brain? Okay, I'm Good half of them. I have no brain. So neural network is is an it's a system to To implement the learning that's based on the actual biology Okay, it's based on the number of the neuron We have you can count especially for the youngest of us about a hundred billion neurons Although no one has ever really counted them, but that's basically what you're made of a good 20% of them are used for just transmitting information. Okay, someone hurts you on the field on your fruit There are a number of neurons that bring everything to your brain. So you can count like 80,000 80 billion Neurons that are available to collect everything your history or memories your reactions and handling all your organ internals Function so that's that's one part. How does that work? There is a there is the first visit this Tree like structure, which are called the dendrites which are collecting the information Every single little point is another is a connection to another Neuron or to a sensor like the for touching for the eyes or things like that It will get the inputs the electrical input that will be sent to the center the arts of the neuron will actually Collect all those information together and at some point you will decide that it has to trigger something and that that at that point That's the threshold. It would release electricity that will go down the axon much further up to like to matter I and send that to something else So your brain decides something and you raise your foot. That's the way it works Now you also have to consider that as I said 15,000 dendrites for one neuron If you think about it pitch be my that's that's a 15,000 condition if then Yeah, and I'll now again to compare that with pitch be my admin. There are seven thousand conditions in pitch be my admin So one neuron could fit actually to pitch be my admin And I made a slight calculation, but I think one neuron would probably have as many Conditions that on my old carrier. I've ever written 15,000 you'd imagine how many of them how many of that that represents That's a huge and you have 80 billions of them that are interconnected Hopefully we're going to make that fit with speech be anyway. We start with that usually neural networks in their Computer counterparts works with three layers. There is one layer, which is the input and we're going to work On that on the moment There is in between all those layers that are just for internal calculation for which we will not exactly know what happens But the training will set everything inside. We just decide how many layers we want so the blue lines and how many Neurons that will be there. Okay, and finally we have one output here We will just decide if it's a comment. It is if it's the comment is code or not So just brute true or false that could be a spatial position. Okay, that could be XYZ and you have Coordinate so you can have actually out complex output categorization things like that Okay I've been talking a lot. So let's take a look at actual code Given fun given the structure of a neural network. Here is the first part of your code, right? You can see there is fun create standards This is really funny because fun extension has a really long names So that's always horrible, but we give it all the information I mentioned number of layer number of input neural or output and the hidden ones We just give that and then fun is going to take care of the rest The other things that I just mentioned it's from the dark and that's going going to make your code full But this is the threshold application the threshold function as I mentioned the art of the neuron will filter part of the incoming electrical input and this is what is Most used sigmoid symmetric. I won't dwell into the into that Now we have the engine now. We have our very little brain We need our first data, right? So let's start with that. We're going to start from PHP That's going to be our our source and that are usually how it happens, right? We're not going to get the comments out of the blue from PHP Miami They have no idea what's in there and that we're going to do this Analysis tonight. So we start from the code the first thing we do the extraction with which is looking into this You know unstructured data and extracting the information that is interesting for us Anyone knows how to remove or extract data comments from PHP? code You've raised your hand too many times Okay, um, basically we use the tokenizer to turn the code into tokens There are three tokens that all store comments single line multi line and PHP docs Okay, so that's the part of the process that completely objective Okay, we're going to give the code to PHP PHP breaks down in token We filter the one that are commons the rest out. We don't need that and we keep them Okay, simple now second part We have this raw data, but we can refine that we're not going to train our our neural network on everything because there are a number of things We already know for sure are not going to be useful, right? we mentioned commons single line multi line and PHP docs we can drop PHP docs They may contain code, but that's probably not going to be code that has been you know show of the side to to let the rest of the Implementation run so this is again you from the raw data you remove everything you can that is as easy as possible to remove Second part we also do a human review What is human review? It means that maybe on single lines or multiple lines comments We can understand and look at that and understand immediately that this is not going to be code We're going to see a few examples, but that's an important thing. We don't just extract raw data to give that to the To the system, but we just remove the one that are completely useless and we end up with that. That's ready for for For fun now the thing is we have a number of comments What do we need to do as experts? We have to tag them as code or not How is that how difficult is that? difficult non-difficult Okay, let's give you to try Here is a little list of things we're going to extract to do the training on What do you think of the second comment? Can someone tell me if they think it's a car. It's a coding. It's code or it's not code second comment No third Some code is inside. I don't want that. I want yes or no So some code maybe tomorrow those are answers. I don't want to hear No, that's good. Actually, it's easy because the red thing means that I have already done the sorting for you, right? Yeah, you're learning. That's great. I see more people than that have brain At the bottom on the top on the other hand we have things that are obviously code, right? Even if it's partial Then it's it's code, right in between in between let's do a survey The thing that is you know as with the gradient the first one a and b and multi-dimensional who thinks this is valid PHP code one two three Okay, it's good. I just want yes or no, but you can decide not to vote right now who decided this is not PHP codes And we have more people right? That's more like 20 Okay at that point this is the moment where your expertise is needed at that point those three of them They know do not have an explicit answer We may decide differently depending on our sensibility Depending on how we know the code Maybe this is actually something that could be found mixing two different kind of logical operators is difficult But that happens right, but this is the moment where we need the expertise Okay, everything that's objective we can decide easily and agree on probably is not the the most interesting This is making the difference between different experts and their their opinions Meaning that you can decide do this I do these Expertise on your own one or maybe have other people, but just know that there's not always a good answer now We're going to have another part we have we have that right we have those comments But we need to input numbers into Into fan because it's not going to repeat English or PHP for that matter I would like you to think a little bit and give me give me things that you would like to look for in those comments That would be characteristics of a code or not So let's say for example, we take a look at that Variables this is very ticky typical of PHP So if we find a structure which has dollar and a few letters right after This is something that's typical of PHP code Note that this is not necessarily always code right the the second one as was mentioned by I don't remember who This is some code is inside. Okay, but This is something that's usually characteristic Can you give me other ideas of things that could be characteristic and help us decide if it's code or not? Can either shout out your answers or hit to the max of the back Mmm, semi column at the end. Yeah, that's another one. There are one two three here. That's a good one other ideas keywords like If equal okay, so PHP keywords that may be the classic conditions while do while things like that All that could be PHP typical application Functions die DL for dump things like that. What else? I like What do you mean? I like syntax highlighting No, no, no, we're talking about plain text not ID stuff So no and I like would be the same that preceding that's keywords You know, you make a list of keywords and those are the one that will be helpful in characteristics Other ideas bracket and bracket and And code brace. Yeah, that's kind of usually useful. So so First raise your hand because otherwise, I don't know who to look at Operators. Yeah, that's I would go in the same category as keywords. So that's that's a good variation Good strings quotes, maybe Quotes is that what you're mentioning? The quotation marks Well, all of them are used and I think they're also used in in language That could help another one finally by there. Yeah Patterns Can you be more precise like that? Could you be more precise because pattern? It's kind of general. That's what we're looking for Yeah, but that would look like you want you want to make a regular expression for like an assignation something like a dollar variable equal and Yeah, a little more complex, I guess that could work One thing that could be useful. Thank you sharing The list of characteristics we're going to use what we're going to stop at some point and have to go on The list of characters going characteristics We're going to use is short here because I want to fit on the on the slide, but it's completely mine Okay, and that's a first try so Operators that was mentioned there is one that's not my choice, but that works too Semi column. I just count them I don't care if they are inside the the string or if they're outside or whatever. I just count them because it's not such a useful Punctuation sign in in English A number of equal the dollar, but I just count dollar I do not even take care of variables just dollars and the size the size the sheer size of the comments Most often people do not like to write long comments to just show outside a piece of code So if it's long, maybe it's a full function or something like that This list is completely arbitrary at that point, right? So keeping my all your ideas you'll have time to make them better to add that make that better To make that better what's important here is from the text we have extracted We can now make very simple searches and convert the comments into numbers and that's exactly what fun needs So we're not going to spend too much time here. This is the ugly file format that Fun requires so basically from the comments We need to turn everything all the characteristics into a number So there is a first line of error very C style the number of Of training a set the number of incoming data, which are the number of columns the number of outgoing data Which means that we have sets 45 set of two lines each line is a set of five number and a set of one number And that's repeated you have an example on the side At that point and before we start actually do the training I would like just to To share a feeling I have because you can imagine that with we're trying to look for code in comments Okay, so something that's kind of structure. We can understand. What did we do? We got some text We counted characters to turn that into numbers This is the input for fun fun read those numbers if that was a count of carrots If that was a number of kilometers that was you know traveled by a snail I don't know. I don't care, but fun is still going to an output something It means that the parts of the expertise the part that we turn the code into some numbers is really important This is the one that's going to give the meaning In between the machine is completely blind. I have absolutely no idea what we're doing It just depends on us to apply that to some reality Kind of black magic. I think but Anyway, it works To complete the first piece of string of script. Here is the training. Remember I introduce one two functions This is the third one, right? We have to provide The initial resource we have to provide the data and we have to provide the desired error That's cool, right? Who have ever done that? You know here some that's all and I want no error do some work That's nice So we start with a desired error of zero dots one what's like one thought one thousandth Interesting. What do we want to do? I think is if we start with a very light. I mean a very high error Fun is not going to train all it's just going to do whatever he wants until the number of errors is Within the margin we have asked so if we make the error way too high is going to train very fast Okay, we give it a very light load is going to train something and makes a lot of errors But that's what we asked so what can we do of course we can raise and lower the error and raise the challenge and Probably that fun is going to work a little more Make it a little more complex, but remove a number of errors and then at some point we can say a We're clever. We want him to train exactly on the data. We have and fun is probably going to succeed But suddenly it's going to be really weird and probably overfitted for our training Remember we are training it on a number of comments and it will meet something else So there is a level of error that you have to Choose and it needs some experience to do that As for me, I just started with the documentation values and it worked very well. So that was nice now Training real case here is the actual values that we I worked on so to train for PHP my admins analysis 47 cases not too many five characteristic incoming and our configuration of three denerones five input one output Six second of training we can do that, right? We can do that. That's nice So we're done. We get a file which is also a few kilobytes don't try to read it It's unreadable, but we can reuse that and now put that on real data The application of the code of the model is just as simple as the code previously used, right? We again create another another resource We create the vector. So even from from the real case we need to extract the comments and Build the same characteristic than the one we use for training, of course Otherwise the systems is not going to recognize anything Then we just call run fun and we get the results in a deneray. How cool is that? Except for the long names of our function for fun, I mean no difficulty there, right and to be true We really want to have Results that are above point eight Anyone have an idea what that means point eight Obviously, everyone has an idea. I have no idea what that It could be a score. It could be carrots again. I have no idea. We have inputted a number of counts counting number of dollars we have counted Characters and we have content operators and we get a number which is a composition of two all of that I have no idea what that means. So is this a percentage? Is it something else and beside that we said you train between zero and one true or false We get results ranging from minus 15 to one. I don't know how you found that Apparently closer to one is good and lower to zero is bad and So we decide to have a to have a threshold in the over the 14,000 comments that are available in PHP my admin here is the Reportation of them. So I said it start with minus 15 and it goes up to To one and when we get close to the end, this is the way the The the repetition goes you see from 60 to 70 is very very steep then suddenly there is a little flat and again It goes as a stair. Okay, meaning that for 80 we could actually be missing a quite large number of results Again, this is completely random. It's up to you to take a look at the weights report. It's Distributed and to decide where to exactly put the the point eight that could be a little lower. That could be a little higher Anyway, the results on PHP my am in other than those ones 14,000 comments Script execution to analyze all those 14,000 comments is 68 milliseconds We said how many we do we are like a 60 of us and we work an hour. We have no results. I Cannot even remove the my finger from my from my my keyboard and it's done And we have about 14% of issues now first first count who thinks it's a lot Too much do you think that that's too much who thinks that's too much we're finding too many too many comments Okay, we think that's about the number of pure comments they would expect Who think that should be lower than that? Okay, so three people more and the rest is sleeping here are the results Here are some of the results so Can you imagine that 27 minutes of work and we actually spot lots of codes? Okay, among them the 2000 we have those and we also find those that was too fast. We also find those Now as I said I introduced you with five function from fine. That was very simple codes. Where is the bug? How come do I find errors here? Do you think there's a bug in my code? I'm looking at you guys No, this is no bug. Welcome to a world where there is no more bugs. This is the first creep you see that has no bugs only false positives So false positives are things that the machine will find but we do not agree with Right, so there is everything everything that is Positive is something that's fun will to find Everything that is true is something that we'll agree with and we have everything the true positive are the one we want Fine found it and we agree with it. That's the best The other one you also like is the false negative Okay, fine do not find it and we do agree or so that it's something that should not be found What we don't want is the false positive which are the most often talked about case and the true negative is with no one wants to hear about because found didn't find it and we don't Want to do the work of fine to check what he has and found so usually people don't find that and I think I have to be done so Let's skip that in terms of results if we analyze them and review the 2000 of them about 50 percent of them were Very simple and repetitive strings Remember when I told you initially that everything that's obvious that should not be tested here trained on Like a Veeam Extend a Veeam configuration line That's basically on every single file because mark is using Veeam to produce pitch pin might mean well We can spot that and remove that. There's also a large number of Paper configurations for some PDF printer in that that we could remove also very easily That will actually reduce the training And the false positive to 800 at least and then we can start again Do our own on work and and review them The the total time to do that from beginning to the to now would be 27 minutes Excluding the compilation of fun. I will leave that to you But it means also that you see doing something that will reduce the amount of work from 14,000 comments to read to 800 to check with a really interesting level of True positive is 27 is 30 minutes So that's interesting that could be applied to many situation You can give it a little try half a mile an hour get some data train them get some result Is it is it interesting enough? Is it like 50% is already removed and we can just reduce the work we do Okay, that's interesting or you try it. It doesn't feel any results or it's really bad bad level of true false positive Yeah, just ditch that and finish by hand So to leave the some room for the next floor your speaker I will Make you read online the older lines that that is how to make the what we've seen better The most important and this is why machine learning is often associated to big data is that? If you want that better, we started with 47 comments more data more training The more you have the more is the better the training will be and the more precise it will be so in any case That's your first optimization get more that I get other applications review them really Check the results and put them back train again Training was five seconds. So iteration should be really fast actually This is the time the evolution of the time of training depending on the amount of Neurons and layers you have inside so even if you want to make it really complex. It's still very fast On the other hand results are always very very fast to come But they may vary a lot actually again. I was so clever. I mean so clever Yeah, I was clever to read the dog, but I was lucky one layer three neurons if you see this is almost the best I Could look a little further and maybe with five neurons that would be even better But still one one level one one layer Otherwise the result can vary from actually what we found to almost everything The wall the wall training here and testing would be another how many that like there's 50 configuration Five seconds by 50. It's ten minutes and you have a you have this graph Ten minutes when you have this graph so you can speak pick exactly the good configuration and make it run for you I'll finish with that If you want all the tools fan is nice because it's in PHP So you can use it on the production server But it's really you yourself with the data and no idea to do anything So that's nice But there are other tools that are more interesting and that will bring you further into the world of machine learning Give it a give it a try and That will be it There. Thank you. Thank you, Damian. Does anyone have any questions for Damian? This is your time to fight back and ask me question after the number of them asked Yeah, of course the the retraction is what I've shown well I've shown very briefly on the on the feedback on the making better section so you do a first run you realize that you have missed a number of Of false positive you get them put them back in training Maybe at the same time get some also true positive with the main training go from 47 to 100 lines train and write again at some point you'll probably get the feeling that you will you will Reach a flat in your progression then you can stop Okay, but yes retraction. That's the easy one There are also learning techniques which include directly the retraction which is not presented here Yes Inside the static analysis at some point as I said Analyzing the codes and trying to guess if there is a problem is good And I try to do it as objective as possible But at some point usually I have to do some wild guess and I try to to push that to machine learning So once I've done whatever I can do And I can argument about it Then I stop and the rest of the results then I do it by experience I go myself in the code and reach the conclusion and train a little Literal model on that so that's my my own my main case the other one I mentioned is Either filter I filtering out data and that could be even for identification I don't tell you to run your identification your login system on that But if you have something that that fast and that prevents you to go to the database to detect something that's Fruitulently try to connect then you're saving a lot of resource Okay, if it passed this first filter which has some error level But is okay, then you can go to the database and save the database maybe 80 percent of the hits That's interesting. That's exactly the way I use it Try to filter out some noise things that you cannot remove and for which you have difficulty to you know find their objective filters Use some training like that and finish by something. That's really powerful for for ETA So when I have scripts that runs For example, I don't know I never know exactly how long is going to take for one analysis to run Okay, and I have 200 of them and the the the structure of the code may may have different impact of my code So I also try to use that for for that So I have I any time I run the the analysis I log the time and I use the The time and the a few characteristics from the code to train the data and have an estimation So instead of just counting the the the amount and the number of tasks That's usually more precise and no one really cares if it's too to erroneous or not Yes Yeah, you're going to have a lot of fun with that You have a big database with lots of profiles and information and characteristics And you want to link those characteristics with the fact they buy or they go to this section or they buy this product Yeah, help yourself. That's going to be fun Which fan try and try an error and try try an error Whatever you give you to try if it's better. You keep it if it's worse to just remove it Okay, at that point we want to well We want to bypass the human thinking to be able either to explore things that we do not think about or to remove Options that are not interesting. So don't don't break a sweat too much think about it But don't overthink it just okay. Here is a few options I give it a try if I remove one as I said, it's very fast to have a feedback So I review me move one. Is it's better? Yes, okay I just remove it and I try to put some on some other Retroaction Now what happens also is this is very rudimentary. This is really good for a session like that So you understand the different phases of the machine learning There are also tools ideas that will do this work for for you and give you some Certainty or some efficiency for every characteristics. This is not the case here Yeah, yeah, the other ones. Yeah, no Between you and me, but I don't want to have that big I didn't check them So as you said there the 40% that was detected by fun That's the one I reviewed because that's the one that's the true positive or the true for negative The other one I just didn't check them because this is training at that point So I don't want I don't want to review the 14,000 to know how how good it is I can also open the rest and make myself. Let me make a quick check You know just don't read all of them try to detect if there are things that have been means that are obvious Otherwise at that point. Yeah, you have you have either to trust your system at training level or or review it review it But yeah, you should be testing the two of them until you're satisfied with the level of certainty Any other questions? No, thank you. Thank you, Damien. They were coming to close of