 And our next talk will be about one of the most important buzzwords nowadays, machine learning, and how his project helped the web compatibility team, who has to parse around 1,000 issues from volunteers about the open web and all. The speaker, Yanis Yanelos, also known as John, or famously Nimo, and yes, we found him, will, is part of Open Innovation and has this project, again, as I mentioned, together with the web compatibility team and Mozilla. Please welcome John. So hi, thank you for having here. My name is Yanis Yanelos, I'm working in the Open Innovation team as a software engineer. And I'm here to talk to you about web compatibility and machine learning. So here's a rough outline. And yeah, Open Innovation. So my team is like, my team is trying to bring innovation in an open way across the org. And experiment with other teams and ideas, and prototype stuff that might work, might not work, iterate and try to improve things in an open way, and also trying to keep track of the value that it brings back. And one of the projects that we worked with is web compatibility team. So the team's initiative is to tackle the issue of web compatibility in the web, which means we are trying to reduce the issues where different websites render differently or behave differently in different browsers, or where apps or projects or websites don't work in different browsers. And part of it is gathering feedback from users and getting input and triaging the input and providing feedback back to the browser vendors. So here how it looks like, like a very basic reporting workflow. So let's say we are example.com and there is a compatibility issue, then users click on report site issue, then they fill a form, and after that it goes through our process and degenerates an issue on GitHub. And the issue is usually like some sort of basic title around what's the website and what's the generic type of issue. And then if you dive into, there's more details on how to reproduce stuff and where it's different browsers or devices. The users have tried this and even some trace plugs from the browser. And as you can see, part of the process of triaging is putting labels and people doing manual work in order to see what's wrong or if something is false positive or it's like something that is valuable or something that is very valuable. So for example, let's say Wikipedia is broken on a browser then it's probably much more important than I don't know, like a very, very tiny local website. So this way this is, sorry, this is where most of the triaging happens. It's on GitHub issues and using labels and milestones. And disclaimer, I'm not like any sort of data science or machine learning expert, but we try to see how innovation looks in that direction. And here's some context. We work closely with WebCompat team. The idea is based on Mozilla Backpack, which is Firefox release engineering machine learning initiative, where they were trying to introduce some sort of like machine learning principles and concepts in the whole triaging process. And the problem statement that we came up is WebCompat projects reporting cadence is so fast. So many people submit WebCompat ability reports. Firefox targets millions of users and this button is accessible to like all sorts of different websites and different users. So people submit a lot of feedback, but only a tiny fraction of the feedback is important and only a tiny fraction of the user feedback ends up being valuable to any browser vendors. So this leads us to another problem statement that the WebCompat reporting signal to noise ratio is too small. We have a lot of spam, we have a lot of abusive content. People probably won't necessarily understand the content of WebCompat ability, so they might usually submit content like, oh, I have a virus or like I have spamware and stuff like that. So only a small fraction of the stuff that people submit actually end up being valuable. And one thing that we came up with while trying to figure out how to improve the project is that we have a lot of historic data. We have around 50K entries of reports and except of that, we also have all the events and the historic data around the reports. So we have like five times more events around when something was labeled or when something was a milestone or who got this issue assigned and stuff like that. So we said that it might be a good idea to train a model using the WebCompat data as input to improve the triaging process. But then we came to the reality and we figured out that we're not machine learning experts or we barely know what these things look like. But everything is going to be fine, it's gonna work out. Oh, good. So here's how an example data point looks like. We have tightly from the issue, we have the content of the issue in free form and yeah, everything lives on GitHub. So first steps. We figured out that data living on GitHub is not very future proof because we've tackled all sorts of different issues like throttling, values of the API, sometimes we couldn't get the data, it's like they were not available or it was so much that even based on the GitHub API policies, it was barely doable to actually use this as a data storage. So we came up with the idea to use some other data storage and apparently Elasticsearch and Kibana was a good fit. So what we did is we took all the WebCompat issues from the API in the JSON format, we feed everything to Elasticsearch and then we used Kibana for the analytics. And it was very useful because we never had the opportunity to actually have analytics around WebCompat reports and WebCompat data because usually what happens is everything used to live on GitHub, issues were in the sign law and we didn't have any metadata or any analytics around it. So after that and after we figured out what we need to achieve and how data look like, we did some research and we figured out there's like a big ecosystem around automatic machine learning tools. So two of the most popular ones is Ludwig from Uber which got a little bit high because it's Uber, it's a big company that's using TensorFlow, it builds up deep networks, the deep learning models based on just the CSV. And the other one is AutoLGS which that's pretty much the same in a more scrappy way. And we figured out that even like with basic tooling and basic machine learning models, we had some decent results. And even like the decent results that we got, like the basic accuracy that we got was probably good enough. So yeah, we need to figure out like what's our data? So one of the most important things that we deal with is actually came up with a proper dataset that actually fit the purpose of our project and our problem statement and actually would help us. And we figured out that by default, even if you had like tons of data structured in a way that is readable from the tooling around the ecosystem. And even though it felt like we had something valuable, we ended up having nothing in the first place because it didn't work. Like we felt that it's some correlation, but things didn't work. And the most weird example is that we came up with a model given the dataset that we had and it always provided the right results which is suspicious by default. And we figured out that the data were kind of lying and they were biased. So one of the most important things is that we worked closely with the teams that provide the data and the teams that do the actual triaging and do all the day-to-day work just to figure out how to deal with the problem. And after a couple of failures and after a couple of attempts that they were very biased, very off and very suspicious on the results, we figured out that the value of the data was not on the actual dataset, but in the process of the events around the data. So apparently what did the trick in our case is going through all the historic data, find all the events and see how triaging translates in actions from the users. And based on that, we defined what a good or a bad report is. We defined it based on this kind of statement and we built the dataset. And after that, we came to the conclusion that we have a decent dataset, we have a model that works, but we need something that works in production and we need something that it's not just a script based on automatic tooling that we barely know how it works. We have some indications that it works well, but not really. And yeah, we tried to see what ecosystem looked like. And we came up to this type of basic tooling. Apparently Python is leading in this world of machine learning and pandas is very good for data frames and handling data is like that and scikit-learn, even though it's if it's researchy and more of educational, it did the trick. And by taking a look at the actual tooling around the machine learning system, we came up with this basic set. And we figured out that things are very simple. And actually, even though we're expecting complicated code base and some external resources for telling us if things are fine, we actually came up with something very basic. All we did is had a proper basic dataset, we feed it to our model and it provided some good results. And even under the hood, it's not that complicated. So what we're pretty much doing is a very dummy approach. We just concatenate all the data that we have from the free text, we tokenize everything and we pass it to a well-known of the shelf classifier, which is Exibust, which is doing gradient boosting. And actually provided like amazing results. But even after that, we were kind of challenging the whole idea of metrics and what success looks like. And metrics came into the game and we tried to figure out what kind of metrics we need to have in order to make sure that we're doing good. So there are many things that you probably know as a machine learning developer and as a data scientist. And it might be very confusing and very complicated, but in the end, what matters is having a basic understanding of the different metrics, making sure that you know what the dataset and what the results look like and be consistent with the things that you track. So yeah, understanding the problem you're solving and what the metrics means to that. And we came up with this results and everyone was very happy. So we got with basic tooling and with basic stuff, we got 90% accuracy and it's not bad, right? So our current stack, look, it's mostly Python-based. We have a project called Web Compact ML, which is the Python packets of all the machine learning stuff that we're doing. It's based on Exibust. We released Docker images for automation and we use GitHub events in order to orchestrate all the flow. And this is our pipeline. So how things look like is every time we have a new GitHub issue, every time a user reports something, we trigger the automation, we send the payload to a simple HTTP endpoint, then from that we spin off a machine learning task based on Docker. This provides some results that we feed back to our data storage. We have some analytics and then if we have like a basic threshold and basic confidence that the results that we have are good, we just post back to GitHub and either close an issue or write a comment or just adding a label to see that this issue doesn't look really good or really valuable or like this issue is very good and we should go for it. And I think the most important thing that at least my takeaway message from all this at least what I'm trying to convey here is that the ecosystem is very, very big in open source right now and machine learning is becoming commodity. So it's not the case as in the past that you need like extensive low leads of like very, very deep research stuff, very, very highly skilled people that did this work, that do this work. Pretty much like the tooling that we have right now and the open source ecosystem around machine learning and data science is very, very approachable. So what I'm trying to say here is that if you have like a basic problem statement that you understand and you have data to back to this problem statement, then quick hacks can bring a lot of value. And in our case, quick hack like that which is pretty much a very, very basic NLP machine learning model provided 90% accuracy of all the input that we have which means that for, I don't know, like 50K reports, we can have signals that 10% is okay and 90% we can skip it. Which in the end brings up more opportunities about the project because if you can have like a way to get the input and being able to throttle the input to your, the pipeline that does the actual work, it brings up a lot of opportunities like opening it up to more people. Right now we're targeting a specific Firefox release. What if we target all the Firefox release or what if we target like people outside Firefox to contribute to the pipeline? So yeah, quick hacks bring a lot of value. We saw this in real life in our project. We have results that show that what we did actually saves time and effort from people doing the manual work without invading in their pipeline. And yeah, I mean, it was quick contained, easy experiment that turned out with good results. So yeah, my moment of wisdom here is that I'm highly encouraging people to try this type of experiments, at least get comfortable with the tooling and the idea of introducing machine learning to your project, especially in open source world where we have open processes, we have like issue management in the open, we have user feedback, especially in big projects that it cannot really easily be trialed by a few people. And yeah, by producing that you can get like good results and also the world has a lot of buzz around it and there's like companies and projects that are very into machine learning and they try to promote like deep learning and more complicated stuff, but in the end, even basic tooling works. We tried as like state vector machine with like the most basic notion of classifier for this type of problems and actually provided amazing results. So try basic tooling, try Exibust. It's like the industry standard for this kind of problems and see the results and in the end, this is all you need. This is the most useful thing I've seen in this machine learning journey that we had. It pretty much guide you through the problems that you need to solve and what kind of tooling you can use around. So yeah, I'm highly encouraging you to introduce machine learning to your project, see how things work, try to be a little bit more innovative without breaking your whole workflow and validate the results and that's it. Thank you so much. Questions now? Don't be shy. Yes, I'm gonna run with my crystal. Someone need you to help me. Hi, thanks for the talk. What's your expectation of issues that are incorrectly classified? Sorry, can you repeat it? Valid issues that are incorrectly classified as spam or not of good quality. You expect them to be re-raised by people or how do you handle that? Can you repeat that? Yeah, so for issues that are incorrectly classified, what's your expectation around that? How do you handle those? Do you expect the users to re-raise them or? Yeah, so the people doing the manual triaging know about the process and they are trying to flag things that are not correct. So every time that we train the model, it gets the notion of the false positives and false negatives back in the training model. So we just keep this part of the workflow. Someone was here, I forget, okay, yeah, upper. I was missing some steps today, don't worry. Hello, thank you for your talk. You said you used the feedback to improve your model. Do you have any kind of statistics how that influenced the quality of the model over time? We don't have any statistics right now because everything is very new. So it started being in production in late November. So we don't have a lot of statistics around or metrics around how it improved with time. But yeah, right now all we know is that people doing the manual work are also tasked to give back feedback about how this process works and how, if we have a false positive, just let us know and we're gonna train everything again. But yeah, no metrics. Can I ask one more? Did you get feedback from the people that work with your system, that work with those predictions, whether it has affected their productivity? Yeah, so it was very interesting because one of the things that we identified early on is that working on a silo and working on your own is very bad in this kind of word because we built a basic model that we're very happy that it provided results and then we started running it in a more experimental way in the actual workload and people didn't really like it because they didn't trust it. So we asked for feedback from the people doing the manual triaging and one of the most important things that came up is that it looks okay, it gives results but we don't trust it, like what is it doing? So part of it was writing documentation, writing some examples of how it works, understanding the metrics, like giving some sort of TLDR about the metrics in machine learning and also what kind of classifiers we use and what is confidence threshold and after that people were more familiar with it and they were very happy, like triagers say that they are happy that there is something cleaning up the pipeline for more important stuff. So yeah, also something that relates to the previous question is that we find very important the confidence thresholds in this project. So given the accuracy that we have and given the metrics that are high related to the class that we care which is the bugs that don't need diagnosis, we have so big accuracy that the confidence threshold can be very high and if you tell people that, you know what, there's like 60% chance that the results are bad, they don't like it but if you say that the model says that it's like 99% correct, then people trust it. So we identified, like we introduced the idea of classifications and labels and high, low and very high confidence. Yes. I'm gonna need some help to pass the mic. Yes, nice presentation. Regarding to the data points that you've showed, I noticed the title of the issue was there as well. Sorry, thank you. The title of the issue. So it was something written in natural language. How did you handle that? Like using NLP methods or something and the next one for internet, like different languages. How did you handle it? Or you just limited it to English? What's the multiple languages? So the title of reports that it is like any other languages, did you filter that out or you just focused on English? So first of all, we don't have any other, like most of the content we have is in English, so this wasn't an issue so far. About cleaning up the content, we used NLP methods to, we used IDF and counter vectorizers and they gave good results. One of the things that we might try in the future is introducing some sort of more highly-performant NLP libraries like Spacey. But even with the stuff from Psykit and even with the basic text extraction methods, the results are very good, so my approach is to not complicate things. TFIDF, yes. Yeah. I want to address the second part with languages. The tool he presents at webcombat.org provides you suggestions when you enter a bug, even if it's a different language website, he will suggest you like, site is broken, mobile version doesn't work, there are glitches, so it's easy to just note that. Even if you add some description in other language, he will identify with the suggested thing that you chose when you reported the issue. Yeah, also part of the general improvement in this project, one of the things that my team also did, which is not in that scope, is we tried to improve the reporting process. So right now we have a free form that is written in markdown and it has some data. And what we're trying to move forward is having structure on like the fields. So for example, what's the browser or what's the description and what are the steps? Because right now it's completely free text. Any other questions? Thank you so much.