 Cool. We're going to start. Okay. So, next up, I'm very happy to introduce two of my friends that are going to be talking about how they tried to use machine learning and maybe worked, maybe. So, Rahul, plus we're calling Chris. Hi, everyone. So, I'm Chris. This is Holden. We did some work on trying to predict PR comments with GitHub, and we're going to be talking about it today. So, if you want to, we put this up earlier, but this is for the stream for folks at home. If you want to go to github.com slash frank the unicorn slash fozdem, it's a repository and you can open up a pull request there with any old open source code you want, and we'll actually iterate on that and process it live today, and we can actually talk about the machine learning we're doing behind the scenes. So, if you want to play along at home, feel free to open up a PR there. So, a little bit about me. I contribute to a lot of open source, primarily Kubernetes and a lot of stuff in the Go community. I've written a book and I just engineer a lot of things, I guess. I don't know. Anyway, I helped Holden here with the Kubernetes bit of things as we started to go down the machine learning path, and I worked for VMware. Cool. I'm Holden. I contribute to other projects, mostly in Scala. I'm an author but of different books, and I work on different projects, and I work for a different company. Thank you, glorious employer for paying my salary. Despite all of this being mountains, I don't climb mountains and also that's not my Twitter handle at the bottom. This is Nova slide template. Yeah. I climb a lot of mountains, and this is my slide template. So, we're going to look at mountains too today. I hope that's okay. Okay. So, this all started four months ago, three months ago. We got together and we were brainstorming ideas for machine learning and Holden had the idea that we might be able to scrape data from GitHub and try to predict the lines of a pull request where people might make a GitHub comment that says like, hey, maybe this isn't the best idea or something like that. The more we thought about it, the more we realized that GitHub stores a lot of information about developer habits. These are things that people commonly do, and then people who are reviewing pull requests all the time, like Holden was Spark. There might be some patterns behind the cognitive behavior there, and we were wondering if we could train a model to represent that. So, we thought maybe we could learn about what normally gets attention from humans in GitHub. Maybe we could use our skills in Kubernetes and Kubeflow and machine learning to build a pipeline to predict where these pull request comments are most likely to show up. Yeah. So, our very high level, super high quality design document, as you can see here, is step one, extract the data, step two, build the model, and step three, serve the model, and then step four, just in case step three resulted in garbage, is to collect explicit and implicit feedback. So, the idea is that you can thumbs up and thumbs down pull request comments, and so the bot will make comments and we can record the interactions. Also, we can do sentiment analysis, and if people start swearing at Frank the Unicorn a bunch, maybe Frank the Unicorn is not the greatest Unicorn. Be very sad. Also, we don't use Kubeflow because I couldn't get my PR to add Spark to Kubeflow integrated in time, but I really wanted to. I have a PR and it'll happen next week. Cool. Okay. So, we started to look at building this thing out. So, the first thing I was tasked with was building some data extraction from BigQuery, and I wrote a program in Go that we can look at later, that runs not only as just like you can run it as a single cron job to export a year's worth of data, but we also have it running in Kubernetes as a Kubernetes operator that will go and audit BigQuery and see if any things changed and automatically update the record behind the scenes if it has. So, we have this resilient ongoing update of data coming out of GitHub, as well as the ability to backport back to 2014, although 2014 wasn't the best year for GitHub data. So, 2015 moving forward as really as far as we got. Then, the BigQuery side of things, this is where Holden really took over. Yeah. So, I utilize my wonderful skills of writing SQL, which I pretend not to have most of the time to avoid writing SQL queries, but I needed some data unfortunately. So, yeah. The SQL query wasn't so bad. The main thing is that GitHub changed their format a bunch of times and the BigQuery table that represents the GitHub data represents it at the format of the time the event occurred rather than the current format. So, it's an exciting opportunity to write a bunch of if-else conditions in your SQL expression, which is just a really great way to spend your Sunday afternoon. So, after that exciting joy of writing very imperative style SQL somehow, I started to build the brain for Frank the Unicorn. The brain is built in Scala with Spark. This was a mediocre choice, but informed by the fact that I work on Spark for my day job and I could probably get away calling that work, if anyone asked too closely. This is, it had some good benefits. It allowed us to train our models in parallel and do cool things there, and it also let us do a bunch of data filtering things there too. Once we built the brain, unfortunately, I remembered why I don't build machine learning models in Spark very often, and it's because serving them really sucks. So, sadly, the serving layer is written in Scala and Spark, and we spent the last six hours getting Scala, Spark, and GRPC to all not fight with each other too much. That was just a really great way to spend a Saturday night. Yeah. So, after we had the Scala and Spark side of things set up, we needed a way to plug that in to GitHub, and that's where the Go came out again, and I wrote another Go program, the second one, and we interface with Holden Scala over GRPC. Her Scala is the server and the Go is a client. The Go also serves as an HTTPS server as well, that the GitHub Event API sends a webhook post request to whenever there's a pull request event. So, if you were to open up a PR, the Go program that's listening on HTTPS will get a big blob of data about your PR, push to it, and then that'll go and open up a connection to Holden Scala side of the fence, and start to iterate and do the back-end processing. If all goes well, that's going to come back to the Go program, and then that Go program is going to go leave a comment on your PR. Yeah. So, that's the high level way of things we're working out. We do have a diagram. So, this is how it all fits together, and I'll walk folks through this. It looks really complicated, but it's really not that bad. So, the BigQuery data comes into the extract operator that runs residually. All of that gets pushed up to Google Cloud Storage as CSV files, and that's an atomic transaction, whatever that goes in. So, whatever is reading from it, I can guarantee that it's a complete dataset. Then we go into the Spark Scala trading bit, and then there's the Scala GRPC server, and then we use the domain fabulous.f, using Kubernetes ingress to serve a public HTTPS endpoint that the GitHub events API pushes to, and all of that kick starts down here by installing Frank the unicorn on a GitHub repo, and then if you open up a pull request, it kicks off the other chain of events that will eventually circle back around back to the pull request at the beginning. Okay. So, I'm going to talk a little bit about the components and what we learned as we were building them. So, as I was building the data extractor, we really wanted it to run in Kubernetes, and we wanted to make sure it could go and check new records. So, I think the big takeaway there was like, we didn't really gain much value as running it as an operator in Kubernetes. It sounded like a good idea at the time because we thought we would be getting meaningful data on the day-to-day basis, but it looks like we probably got a lot more just by backporting over the last couple of years. Furthermore, as we were going through the GitHub data, we realized that it really wasn't clean at all. So, the majority of the Go program just turned into data sanitation and just checking values and making sure that it was all going to fit nicely together. Then, the atomic part of things was important as well, and Go was pretty okay at making this happen, because we could use a new text to make sure that if we were writing a file to a CSV that it wouldn't undo itself before the file was done written. Let's see what's next. Oh, so the Suggestor, this one was fun. So, I got to write an in-memory concurrent queue because we had two main parts of the program happening at the same time. The first one was the HTTPS server, and then concurrently we were running the GRPC client that was talking to Scala and Spark. So, the way that this worked is the HTTPS server would get a post-request to it and stick it in a queue, and then concurrently in the same process, the GRPC client would go and pop-soaking off the top of that queue and then go process it in the background. So, we're seeing this really interesting concurrency double server client pattern that Go allowed us to do, that was exciting. Furthermore, we had the patch view for comments. This was what we learned about the GitHub API. If you want to leave a comment on a PR, it's not as simple as saying leave a comment on line 12. You actually have to go through and calculate, based on the patch view, how many lines down from the previous change, you want to leave a comment. So, there's a little bit of math that we had to do for every one of these. Furthermore, we used Contour for Ingress because we had to serve this whole thing publicly or GitHub wasn't going to send us any events. So, that was exciting to get to use Kubernetes Ingress to solve that problem. Cool. And so, to make this all work, we needed to train a model. Otherwise, it wasn't going to be very useful. So, we trained this with Spark on Kubernetes. There's a whole bunch of different kinds of classification models built into Spark. I tried a bunch of them under the principle of, why not? And for the most part, they all kind of performed similarly poorly. Greeting and poosin trees performed a little better than most of them. And so, that's the one that we just went with. And the first iteration of this, we performed a little better than guessing, but not a lot. That was really depressing. We added some more features and we got more better, but not much more better. And so, this was like a kind of sad start to the project from my point of view where I was like, oh, I have a model, but it's garbage. Okay, right. And if anyone here has run Spark, you know that part of running Spark is collecting out of memory exceptions. They're like Pokemon, except they're like the really crappy Pokemon, because they just show up everywhere. Our favorite Oom that we collected was the container Oom Kill. And this is because it happened often and there were no logs. So, unless you were watching the pod status, you'd just be like, why are my executors disappearing? There's no log messages. And that took me, I mean, that took me several days of trying and then eventually asking Nova. Yeah, and then I got on the phone and fixed it in like half an hour. Yeah, okay, fine. So, and that was good. And then the rest of them were sort of the standard Ooms that we get with Spark. And I'm well aware of, maybe not happy with JVM worker whoops, ah, heaps based out of memory exceptions, but like, that's just my life. That's my jam. The driver out of memory a bunch of times with some of the models during training. And this is because we essentially used the driver as a parameter server during training. And for some of the models, that was just too much memory because it wasn't cleaning up very well. And that was kind of sad because we weren't training on huge stuff. So it was a little depressing, but it's okay. Okay, so we used a bunch of different features. Word2vec was the one I started with because the source tech blog post about ID2vec encodings, I was like, that sounds like a cool encoding, maybe I can do this. And then I did, and then the features, I mean, I'm sure they're good features, but they weren't good for this. But it was our starting point. I also tried TF IDF, which is a standard document retrieval thing, not super useful. And then it started to become things that look more like, suspiciously similar to what a linter might be looking at, just fed in as features to a decision tree. Do, do, do, oops. So lines that are all spaces, the percentage of spaces, what language it's being written. And the last one is the one that I think is interesting was we also looked at the GitHub issues associated with the project. And this one actually turned out to give us one of our bigger performance boosts. If a stack trace ended up touching the lines that were being changed, there's a higher likelihood that that is sort of complex or confusing code that people will want to ask questions about. And so that ended up being a pretty good feature. And there's some ways and ideas on how we can extend it from there. Yeah, so we did some hyper-parameter tuning. And for the most part, more trees approximately better. And then we got diminishing returns just pretty quickly around like 20. The other things we did didn't really make all that much difference. And before anyone gets worried, and you should not be worried that our performance numbers are juiced based on how incredibly bad they are. But I did save a test set before I started doing my hyper-parameter tuning. So it is valid and legit. Okay, so yeah, yeah, this is the slide of sadness. So like a good score here, how would be two orders of magnitude higher? So that was not great. But for the data set that we have, it's like super imbalanced. It turns out that people really, as a percentage basis, do not leave a lot of comments on PRs. And so the random guessing score would be like 0.03. So 0.09 is three times better than just like flipping a coin and being like, cool, looks like a great place to leave a PR. And I think we can do better with more data and we'll talk about that and how you can help us make this model less janky. Yeah, and so this is the list of things in the training area specifically that we wanted to try and improve. One of them is like the classes are super imbalanced and so that was really rough. And so we really want some more data of people interacting with Frank to say whether or not Frank is making good predictions. Flexors and tokenization, we could do much smarter things than what we were doing there. We don't have a lot of context. We look at things on a line by line basis and context matters, so we could do smarter things there. And also I think bringing in more logs, like if a piece of code is really confusing on Stack Overflow and frequently referenced, that's like probably another good sign that would be useful. And we can explore some different models. I don't know, we could use deep learning if we wanted to, but I don't know. We have a link at the bottom if like this is a thing that you care about, those in the back cannot see it, sorry. But the slides will be posted later. We'll tweet them and you can fill out that link and you can submit PRs. Frank will ask you for feedback. Cool, so a little bit about building the GitHub app that we're hoping gets approved, but we'll see how it goes. We had to go through and effectively demonstrate that we did have a working public endpoint that was TLS encrypted and then it actually did something behind the scenes. So that was exciting and kind of fun to go and get to play with GitHub API and actually make it so that we had like an interactive demo for today. And then I think Holden here wanted to shout out. Yeah, if anyone is watching that works at GitHub, please approve Frank the unicorn. Frank dash the dash unicorn. Yes, good point, yeah. It is totally not gonna steal people's credentials, I promise. We are good people. Yeah. Cool, so if you wanna find the source code for any of the programs we wrote, the Scala, the Spark, the Go, the GRPC, that is all in github.com slash Frank the unicorn slash predict PR comments and all of the Kubernetes side of things is there as well. So if you wanna see examples of how to run all of the pipeline in Kubernetes, that's all there and you can try Frank out here and then we can show you the whole system up and running. We have it working right now live and let folks see what they want or if they have questions or whatever, we can do a demo. Demo, demo. Okay, cool. Let's hope the Wi-Fi's still working. Yeah, I think first what do we, do we wanna look at the answers or do we just wanna show folks behind the scenes first? Yeah, let's do the Qt logs. Okay, cool. I'm gonna put this down for a second. I can talk about what you're doing. Okay, cool, yeah. So Nova is gonna bring up the logs for the different components and there's some debugging information that we output about the PRs that are being sent to Frank and also the features and what Frank's predictions were for those features. And I guess Nova really likes aliases because typing is too much work. So yeah, we have three pods. There's the model server, the PR Suggester and there's Ubuntu which was just used to debug it because they weren't talking to each other this morning. You can see we were working on this eight hours ago. I have not slept a lot. Okay, yeah. So the model server is the one which has the more interesting pieces insofar. There is the file name and this is so signal process, someone submitted this and these are the lines that Frank was curious about. I fucked up this, oh damn, whatever. I made a slight mistake in this logging message so we just print out that it's a list which is not useful to anyone but it's a very nice decorative list. I think it really brings the log messages together so I'm not taking it out. Do you wanna show people the features and stuff or do you wanna talk about this? Yeah, go for it, I can see them up. Oh yeah, cool, okay. So, whoa, okay, cool. Whatever, yeah. So you can see some of the feature vectors they're all mostly chopped off just because it's in show mode and the feature vectors are pretty long, right here. Cool. Yeah, and so I think the last one is predictions so this one here is set to one zero and it looks like it really didn't have much going on on that line and so that's a thing, I don't know, whatever. It's some log messages and we could go look at that PR maybe and that'll be more informative. We should probably open one up and watch it go live. Or someone in the audience could open one up and watch it go live, yeah? Yeah. No? Yeah, yeah, thank you, thank you. Okay, so the other side of things that we're gonna split the window here as Francesca opens up a PR for us is we can do our alias again. So I'm lazy. Does the alias actually set you any time if you type it in at the start of each test? Yeah, but now I know that this is like always and I don't think about it. So tailing the logs for the Go program here, as soon as Francesca opens up a PR we should see the Go program. Yeah, you wanna count down and hit Go? Count down. Cool, so yeah, there it goes. So yeah, what happened was GitHub sent us an HTTPS request, you see that over on the right. It says we received, oh thanks, a PR called hellofriends.go and then you can see the scholar and Spark in the back end here processing the request and we're already done, so Frank's already left some comments on your PR here. So let's go look at this. Oh, 28 pull requests. Wow, we have a lot, okay. Man, we have friends. Yeah, so this is always the exciting part is to see what Frank decided to comment on. And this is actually a lot of fun. So it looks like Frank, for whatever reason, thought package main did not look very good. Empty line on line two, line three and line six. So I think this is a representation of, if we have a small sample size, it's gonna find more in there. Yeah, so one of the hacks that I did to make it slightly less garbage is that it more or less pulls the top K worst lines. And so when you submit four lines, it's like, well, of my top five worst lines, these four lines are it. That's not exactly what it's doing, but it's pretty close. And so for small PRs, that optimization was a bad idea. But for some of the bigger PRs, it's a little bit better because otherwise Frank was kind of unpredictable with how many comments Frank was leaving. Do you want to pull up a bigger one? Yeah, let's go look at a bigger one and hope it doesn't prove me wrong. Cool. Well, this one's got signal processing module. That sounds like it's probably not five lines unless signal processing has changed substantially. Okay, so, okay, cool. Okay, so Frank really doesn't like blank lines. Which is, it's not the silliest thing for Frank to be upset about because I think there's a lot of situations where especially in Python code, people have extra blank lines. You leave comments and you ask them to take them out. And the problem is that we don't have a per language model. And so Frank is just like, yeah, here's a blank line. And it doesn't have the context of the previous line. So it doesn't know like there have been X blank lines before this. All it knows is that blank lines are a little suspicious and end up having comments on them. So that was a depressing discovery. Okay, so Frank doesn't like how this comment was closed. That's probably a style matter, but this means that on average, probably on across GitHub, people don't like that style. They prefer closing their C-style comments differently. And so that's why Frank is upset. Stars close together. Yeah. Okay. Oh, right. And here's an else, I guess. No curly braces. Yeah, Frank gets upset with elses without curly braces. To be fair, like this happens a lot in the Spark project. We make people use curly braces. And I imagine there are a lot of other projects which also make people use curly braces. And so Frank has just learned this de facto style from the aggregate of GitHub. Okay, Frank doesn't like incrementing size. Well, to be fair, a lot of folks, especially in Kubernetes, like to stay away from the plus plus convention and they actually like to do the literal arithmetic. So there might be some learning there as well. Yeah. So essentially what this shows is that Frank learns a lot of things and most of them might not apply to you. And so I think one of the challenges is perhaps training it across all of GitHub didn't give the greatest results. And it's like there's individual projects that are large enough to have sufficient training data doing a per project model could also be kind of cool. Yeah, okay, Frank is just being, anything in Frank? Okay, cool. Cool. Do you want to pick another Pia? Sure. We've got five minutes. Does anyone have questions? Does anybody want to see anything else? Okay, cool. Oh, there's some great Scala code. Oh yeah, let's look at Scala. Where's that at? Oh yeah, a great piece of Scala code. Okay, so Frank came in and left some comments. That's a good sign. Large disner not rendered by default. Nice. I thought he did, sorry. It's loading, conference Wi-Fi. Okay. Interesting. Class B. I'm always fascinated by like what he chooses to like comment on. Like what do you think he's saying? We have another one here. Oh, I don't think this would be described as idiomatic Scala. Yeah. This looks more like C. Yeah, I don't. Honestly, in this one we probably should have taken the limit off of Frank and just allowed Frank to comment on every line. Yeah. And be like, what are you doing? Oh, interesting. You're getting the same comments on the same line. Oh yeah. This file really messed with Frank. This is a good file. Questions? Yes. So the question is when is this better than static code analysis? So I mean, I think that the set of things that Frank has learned are not all that better than static code analysis. Yeah. It's interesting. And I think you could take the same thing and apply it to a specific project or like a group of projects. And then it could perhaps learn, we've seen that it's able to extract elements of common style and ask for that. And those things might not be easily captured by the static analysis tools. I know that, for example, the Scala static analysis tools are special. And so it probably depends on your language and what the other tooling is that's available there. Also I think the ability to plug in other features like the issue data or like data from mailing lists or data outside of the repository and pipe that back into the repository would offer a little bit more as well. Oh yeah, right. That's a really good point. And this, unfortunately, the demo isn't able to show the one feature that Frank did really well at because we don't have the issues associated with the pull requests that people are making. But that's, in its train test validate, it did much better for finding situations where the code was kind of sketchy based on users' interactions with it and their reported problems, for sure, yeah. Question. You put them at zero one, and it correctly plugged in. So I mean, I put them at zero. Yay, oh my God. So if there's one where it's working, like, yeah. So there is, so while it is like one model per language, there is a feature which represents what language it's written in, and since it's GBTs, like, that's often probably close to a root node. So to some degree, we have per language models, but not really. And I think if we train directly per language models, we would indeed get better results, oh my God. But what was your pull request number? Or pull request 16, yay. Let's look at the one where it worked, maybe, sort of. Is this the right PR? Yeah. Okay. Oh. Do you know where it is? We can just search for Frank. Okay, was this one? Okay, cool. It's very common. That's exciting. That's cool. Cool. I'm glad it worked once, maybe. Yeah, yeah, yeah. Three times better than guessing, but guessing is not very good. How much does it cost monthly? So we're running everything in Kubernetes. We both work for cloud providers, so we're both kind of spoiled. Yeah, everything is free. Yeah. But I mean, realistically, I think the most expensive component here would be getting the spark side of things up and running. Yeah, so the model training part is expensive, relatively speaking. If you wanted to train it for just your project, you could probably do that really cheaply. Or like if there's a family of projects which are similar to the things you care about, for example, like if you trained it on all ASF Java projects, you could probably train that very inexpensively if cost is a concern. And serving it would be like, I don't know, the cost of one node, or maybe two nodes. Yeah. As far as the data processing and the GitHub endpoint, that's almost nothing. It's a very lightweight server with a few hundred lines of code. Question? Sure, yeah. Question is how we represent the code and if it's all represented as a single vector. And yeah, so we use the word to vet embeddings. We do cap the length of the input that we consider on a given line. Normally lines are huge. And if they are, that is in and of itself, another feature, like line length is another thing, which was a strong predictor for PR comments anyways, when things start to get out of the scope where doing that is reasonable. We end up just commenting anyways. And so we have one vector representation. It's not great. I was hoping it would perform a bit better. I think we probably need different lexers in front of it. I think we need per language lexers and then we could probably get better representations. But for now it's okay. No, that means Frank. Oh, so the question is, if I didn't get any comments, does that mean Frank is satisfied with my pull request? And the answer is, it probably means Frank crashed. Which PR? Oh, okay. Oh, I think there's probably... Trixie, Trixie. So no, so this did not drop the database. But it actually, I think it might ignore gitignore files. I should take a look, but I think it ignores pure.files right now. Because I think I just, yeah, I'd have to double check. All right, okay, cool. Are we out of time? Awesome, so thank you. Thank you all for listening. If you do wanna give us feedback by thumbs up or thumbs down on the pull request comments, that would be greatly appreciated and we'll use it to train a less shitty model.