 So, thank you so much for having me over and for coming to this remote room. I'm really happy to be able to share something and hopefully teach you something here as well. My name is Maje and I work as a data scientist at Trainforest, where we do, where we use lots of crowdsourcing to make our product work. And before we start, I'm just really interested to have any of you to use something like mechanical tech or cross-flower or something like this to get data to people. And then I would like to share like what you did, some kind of problems you had with it. We used it to basically go through transcripts of interviews with people who are done with people regarding disease outbreaks. We used it to tag certain keywords in there. Okay. And was it like, did you get good results? We had a ton of people use it and the results we got were spectacular, but we didn't have a big sample set. So, the biggest problem for us is it's kind of related to that. It's kind of difficult to have reliable quality when you're asking people to do something. So, there are different ways of doing crowdsourcing, the one we use because it's mechanical tech. We pay people some pennies for every small task we do. And there is a bunch of trade-offs and things you have to think about when you do this. So, there are a few parts to the stock. I will tell you about the kind of data that we gather, that we have, the way we model it and how we use it to improve our product. And there's a little bit of technical stuff about machine learning that we did. But the part that we all realized recently is that we're missing quite a bit of empathy for our testers, for people who work for us and how important it is by the end of the day. So, the first part is the data. And I have to tell you a little bit about what Rainforest does. I promise not to sell you too much on that. But basically, we provide QA as a service. So, if someone has a company and they're building up a QA team, we try to replace the QA team so to make sure that people can just use our service instead of building a team. And we use humans to basically find problems and, well, applications. So, the way, what the customer does when they want to test the application, they just write tests in English. So, they say, okay, the testers should go to the CRL, do something, and then they ask a question. Is the page okay? Does the logo show up in their place and so on? So, that's an example of one test. There's just one single step. It says, go to the CRL, look at the page. Does the image look okay? And then the corresponding thing that workers on the micro tasks do. They have the instructions, they read them, and they answer the question by following the instructions in the terminal. So, you can see below the terminal, the right browser version and allows the mechanical tech or crowd flower workers to follow the instructions and check whether the website actually works. So, what kind of problems might, there might be that. First of all, testers might not be able to understand the instructions. They're not all native English speakers. Not all the tests are written clearly. So, that's something we have to overcome. Even if they understand the question, they might not know what to do in a specific scenario. There's lots of ambiguities, like, is this a bug? How precise someone is supposed to be, and so on. And finally, even if they do understand everything, they might still, because we're paying them for it, they might still want to just cheat and click, yes, yes, yes, everything is fine as fast as possible, just a bit of money. So, that's also something we have to think about. And what kind of solutions we can use to deal with these things. There's first of all training, to make sure people know what to do. We can set up incentives greatly. We can track people that have been with us for a while and have high reputation. We can trust them more than others. We can have multiple people do the same thing to get some redundancy. And finally, the technical focus of this talk is this idea based on the paper from 2011, where they figured out whether people were actually doing what they were supposed to do or not by looking at input patterns, by gathering data and training a classifier. And this is also what we have done. So, the way we capture the data is using a virtual machine, the terminal that you've seen in the virtual machine, everything that happens there is recorded. So, the mouse moves and keyboard inputs and network requests between the browsers. And this is how it looks like if you also capture screenshots. So, if you see, if a tester sees a login page and they click on username and password and paste things in, we capture that and we know what they did. So, knowing that, we have these... Our data is basically a collection of things, a feature vector describing the work people did and a label saying, this is good work or this is lazy work that we should not take into account. Feel free to shout out if you want to sort of creatively think up features that we can create from that. So, what we have is all these mouse movements that you've seen and what people clicked and we need to turn it into a string of numbers feature vector. So, feel free to shout out, but I can also show you what we've actually done. If it was a count number of clicks, try to get the speed and the acceleration of the mouse cursor, number of keys, the presence of specific combinations of keys like copy and paste, for example. You can try to look for keywords in the test instructions. So, for example, if you click, there should probably be at least one click detected and so on and so on. It's kind of fun to experiment with the things and try to think up new combinations of different things that might be informative and then using Python and the scikit-learn package, which is pretty awesome and makes it really easy to look at things like feature importance and iterate on your pipeline. So, the second part is labeling the data. We capture everything, but we don't know out of the box whether a given piece of work is good or not, and that needs hand labeling. So, what we do is we send a weekly email to developers with a few links asking them to tell us, like, okay, is this good work or is this bad work? And the interesting part there is how to select which data to label. We have some tens of or hundreds of thousands of data points every week and only 20 developers and asking them to do even 10 label links per week is already stretching. So, we can only sort of have information about the tiny fraction of our data. So, at first we just had a completely random selection of things to label. Then once we got enough data to train a first version of a classifier and we verified that it was working well enough, we split it in half and 50% was the random and the other 50% we sort of chose the worst offenders. The most lazy work we would find to check whether the classifier was working correctly or not. But now we're actually moving back to random because there are actually two purposes for labeling. One of them is improving the performance of a classifier and gathering data and that's when it's good to have some kind of bias. But there's also another purpose which is to evaluate health of your system overall. So, to see whether some changes to do are having good effect or not. And to do that you need to have sort of a representative sample and having a bias there like we used to have is really bad. It doesn't really allow us to say anything informative. There's also just a small part I want to mention. It gets pretty technical but if you're doing a classification and your data set is unbalanced which means that you have many more samples from one class than the other class, you might find problems. There are different algorithms that deal differently with this kind of scenarios and there are methods for balancing your data set. You can throw out some data or you can come up with new data depending on... and there are different trade-offs. But first of all always try being able to say how much it's going to try, it works and just make sure you have good metrics. So we used to do data balancing because the majority of the work in our system is actually good and only a small proportion is bad. But we discovered we don't actually need to explicitly balance the data and it's fine as long as you can prove that your classification is good enough. I'm not going to explain something and super something in too much detail. So moving on to the model. It's a one-on-first. If some of you are familiar with machinery and algorithms in general, one-on-first are really popular. They're really sort of easy to use and easy to understand. And we use them for binary classification with the features that I have shown you before. The remaining part is how to... a model-on-first can give you a classification. Basically in our case it's a class. So it's either a 1 or a 0. But it can also give you a probability. So we use the probability and a count to be able to adjust the threshold to kind of the risk level that we are comfortable with. And finally, a while ago I did this exploration and I had my Jupyter notebook with all the code running and I thought, okay, great. I solved the problem. And then some people I worked with said, okay, sure, that looks good. Let's use it for something. And then I was like, okay. So how do I actually put it into production? So what we needed is something like this. We wanted to have some web servers that are sitting on a server and communicates with our application. We see a bunch of JSON, does the magical classification, and returns a bunch of JSON back to say this is good work or it's the better work. Of course the application also has some database behind it. And this is all you need once you have your model trained. So once you do your analysis, you train your model and you save it somewhere and your prediction service can use it. If you want to train your model, you probably will do something like this because otherwise you need to send lots of data through your application to the service. But this is generally a good idea because now you have two different things talking to the same database and if you end up changing the schema of the database you need to adjust two things at the same time. So that's generally not a good idea. There are other ways around it and every time you have to consider the trade-offs. In our case we don't need to do the retraining too often, like once a week, once a month, it's fine. So we can just do it hopefully in a manageable way. So we access the data, or either download the data on a local machine or something like this, do the training, verify it and then actually use it. There are a bunch of other requirements as well. The most important one for us was to be able to debug when something was wrong because machine learning is a slightly different way of operating than regular software development. There was a really interesting paper from Google that had a title like Machine Learning, the High Interest Credit Card of Technical Data. There are lots of pitfalls when you sort of create this magical black box that does specification for you and you don't really know what's going on inside. So if you are using something like Machine Learning Production you need to take care of saving enough information together with it to be able to not only re-instantiate your objects but also re-create them. And I will talk about that in a second, but the rest of the application is pretty simple. It's just a web service, returns a bunch of data, gives it to the classifier, returns it back as JSON. That's all there is to it. The interesting part is how to load a train model into memory to be able to use it. Are people familiar with Python here? So if you are, you probably know what this is. This is a pickle. There is a Python module called pickle that allows you to take objects you have in Python and serialize it down to a file. That's where you can save it for later and read it back. But there are problems with it. First of all, depending on what you are serializing, there are different packages. For example, JobLip is a package that is bundled with Scikit-learn. That's more efficient. It's serializing now by rates. So that's great for Scikit-learn estimators. But there are also other situations. So three years ago at Python, there was an interesting talk about why you should never use Python. And I encourage you to watch it if you're interested. But the main problems are that you are making a bunch of non-explicit assumptions. When you're serializing something, you implicitly depend on versions of your packages. And unless you take extra care, you will run into problems in the future. And also it's really insecure. So if you try to deserialize a file that someone gives you, that basically repeats to remote code execution. So only deserialize objects that you really sure are safe. And if you look at the documentation of Scikit-learn, all these kind of things are mentioned there. And we wanted to do basically for the good practices. So we have created a small Python package called Testimator. It's open source. It wraps your Scikit-learn estimator in class together with a bunch of metadata that you give it. And then you can serialize it. It doesn't do anything to help you with security. But it makes it slightly easier to bundle and ship trained machine learning models. So it has some limitations. As far as I know, the only people using it. So if there's something you would like the estimator to have, that would make it useful for you. Let me know. I'm happy to improve it. But finally, the other part of the talk is about actually considering whether technical solutions are suitable for what you want to do. So because we deal with humans, we need to understand what they do and why. And if you ever need to do some work on mechanical tech, make sure you spend some time as a worker yourself. At least a day or an hour a week for a period of time. Going through the tasks and just completing small things to get to know the system and really see how workers behave and why. And also give people ways to talk to you. This is pretty obvious now in retrospect, but we used to treat our workers as kind of... We have our customers, we're banners, and our testers were just the nuisance that we had to use and kind of catch the bad criminals. But that's not the right way to think about it. We really have to understand them and talk to them and listen and ask them questions. And there's one more thing, a short story I wanted to tell you. This is my dog, and I got her a few months ago. She's a project colleague called Whiskey. And when I got her, the project colleagues agreed because they released my dogs. But if they get bored, they won't get distracted or they need something to talk to by their mind. So when I got her, I started watching YouTube videos on how to train your dogs. And it's really fun, I can really recommend doing that. But one thing that really stuck for me is this really cheesy notion of setting them up for success. So if you have a small dog and, for example, you want to teach them to sort of count to you when you call them, you cannot start by putting your 20 meters away and then just trying to get them to come over and understand. You have to start half a meter away and say, I'm here, they come, you give them reward. And you do it over and over again, and then you take a step back, and then you do this again. And then you gradually increase the distance. And at every point, it's really obvious for the dog what to do, and the step to the next level is kind of manageable. If you give them, if you want them to jump from 0 to 1, immediately they won't get it. And I don't mean to imply that training dogs is the same as teaching people, but I think some of the same principles apply. And I know that when I'm learning something, it's obviously much easier to learn things in small gradual improvements rather than get the whole concept at once. So, ah, and this is Whiskey. Recently she learned to successfully tear up my backstations. But yeah, the same thing applies to your crowdsourcing. Make sure you train people progressively and give them an easy path to do the right thing. And this is basically it. So we have some, like, the whole approach that we took has some limitations. Our data capture is not always reliable. So if we miss some clicks, which happens, we get wrong classification. So that's also something we have to take into account. And it requires lots of labeling effort. So there are some smarter things we could do like there's this thing called active learning that looks at the unlabeled data and tries to select the point that if you label them, their classification should improve. We could also do something more meta to try to get workers to verify how their workers work. But it also comes with a bunch of programs that we haven't yet found solutions for. And the main takeaway, so, like, to impress upon you today is that festivals seriously evaluate whether your technical solution is the right thing to do at the time. So we're pretty happy with where we are now. But if we started thinking about empathy before we implemented all this, we have probably gotten to where we are faster. And we will be happier. Also make sure you respect your community and people you work with. Even if they're not your customers, they're still sort of rational human beings. So make sure you understand their perspective. And at the end of the day, machine learning is still useful and it's a really cool thing to do. But it's not always in every situation the biggest bang for the buck. And with that, thank you very much. So I'm just imagining that when you started this, you had some, at some point somebody proposed a hypothesis that machine learning could help you evaluate the quality of the product. How long can it take you in your company in this effort to decide whether it was actually useful or not? And what were the criteria to actually say, you know what, this actually is going so marvellous, but more time and investment into it? So there wasn't really a formal process. We're still a fairly small company and when I joined as the first data scientist, there were, I think, ten of us. And I had this idea and people said, yeah, sure, sounds good. We trust you, go and do it. And that was basically it. We were just maybe a week or two in exploring it and plotting the data and trying to figure out whether there is something we can find and whether the features that we could think of can we look at a data point and at the features that we can compute and as humans, can we figure out whether it's a good work or not? And it wasn't very disciplined, it was kind of qualitative, but that was the process that we went through. And gathering feedback on lost movements, keystrokes, did you consider listening to the microphone with their permission? No, we haven't. So first of all, we do have the permission and they are people who work for us and we tell them what it involves. We didn't consider a microphone, why would that be useful? Well, I guess sometimes that's for feature testing, because someone is saying, damn it, fuck. Right. That's what we call it, the kind of information interaction. So I think it would be a little bit challenging at the moment, but no, we could probably make it work. It might be something we should think of. So we do that for each task that we have. We select at least three people to do the same thing and then we look for consensus. There are some problems there because even if you, first of all, different people might get different results even if they're all doing exactly the same thing. Even if they're supposed to be using, they're testing the same service. Sometimes there is like a random mark that occurs for one person and not the other. So we look for consensus, but we cannot really use... So someone's work is projected because it doesn't agree with majority. We cannot really use it to penalize them because it would be unfair. We are thinking of doing some more smarter ways to sort of look not only at consensus and results, but also at consensus and how they go about completing the tests, which we're not doing it. So that would be kind of interesting. Yeah, I think it's still kind of an active area of development for us that would be interesting to see. Thank you so much.