 The talk will be in English, but now the presentation will be in Catalan, just four words, so don't stress about that. Bueno, avui tenim el Xavi a Matriahín i estic molt contents, a mi em fa molta il·lusió. A mi em fa molta il·lusió moltes coses, però avui em fa molta il·lusió. I ens parlarà de la seva experiència implementant methods of machine learning en entorns de producció. Ell és vicepresident d'enginyeria a Quora, que és un empresa que estàs unit a Silicon Valley. Ah, sí, perdona. Ell és el vicepresident d'enginyeria de Quora, que és un empresa que estàs unit. Aquests dies l'he anat fent servir i trobo que està bastant bé. No és que sigui propaganda gratis, però jo he bastant de coses en machine learning i està molt bé. Ell era bastant conegut abans per treballar a Netflix i ell és d'aquí de Barcelona. Va estudiar aquí, em sembla, a la Politécnica. Després va fer el doctorat a la Pumpeu i després Telefònica i MSD. Bueno, al final de tot farem preguntes, vull dir que si teniu preguntes, guardeu-vos-les pel final i li podem preguntar, d'acord? Moltes gràcies. Gràcies, Aleix. Per dir en anglès, per un poble de reasons, una és que sé que hi ha persones a l'audiència que no parlen català, i l'altra és que estic més enllà de parlar en anglès quan parlo de tècnica, doncs això també ho farà millor per mi. Però guarda amb mi per un segon, vull dir almenys un poble de ordres en català, també. És un plaer fer aquesta xerrada. És una xerrada que no hagi pogut estar allà amb persona, perquè m'hauria agradat. La veritat és que vaig a Barcelona una o dos cops a l'any i no coincidia gaire. No sabia quan seria la pròxima vegada i vam decidir amb l'Aleix que seria millor fer-ho per videoconferència. Però sí que és un plaer poder parlar en català, de tant en tant, jo ho faig a casa cada dia amb la meva família, amb els meus nens, però no ho faig a diari a la feina. Farem la xerrada amb anglès, però si ho vol al final, guarder uns quinta minuts per preguntes vol fer-la en català o amb castellà, també, cap problema. La primera és sobre aquesta qüestió de data i models. I què és més important? És més important que hi hagi més data, és més important que hi hagi més models. No sé com familiaris amb aquesta discussió, però aquesta és una discussió típica aquí de tot el temps. En conferències i parles, es diu que hi hagi més data, més models, doncs volia posar una perspectiva en això. Just even going back to the Netflix price, so this was 2006 when the Netflix price started. There was a pretty popular blog post by Anand Rajaraman, he's pretty well known, he's a professor at Stanford, he was formally senior VP at Walmart and he's had a number of startups. At that point, as a professor at Stanford, he told the students to actually try to win the one million dollar Netflix price and they did a lot of experiments on trying to improve the accuracy of the algorithm by using machine learning. And then he wrote this blog post about more data usually beats better algorithms, which was controversial and still is. And the point that he was trying to make is that by augmenting the data set that Netflix had released, and for those of you that don't know, that data set was made of just user ratings on movies. So basically, user X watched this movie and rated it with a five star or a three star. That's all there was in the data set. So some of his students augmented the data set by looking at IMDB and adding some metadata. And it turned out that they got better results that they didn't use the metadata. So his argument was really, we're wasting our time thinking about algorithms. We should just look at the data and add more data. Now, is that really true? Well, it turns out that it's not. As a matter of fact, years later, once the Netflix price had finished, which by the way, it was won by people that were not augmented the data in any way. There was this other publication, I think this appeared in the ACM-REXIS conference, where a couple of people that were not in the winning team, they were in the second team, they actually proved or showed that metadata or augmenting the data set with metadata was actually not helping at all and adding a few ratings was better than just using whatever metadata you could use. So no, it's not always the case that more data helps. And in this particular case, we have a good counter example. Now, I'm not saying that it never helps to have more data, but the point is that sometimes, really it's about having better algorithms, as it was in this case, not about having more data. Another example of this, Peter Norvig, who is the chief scientist at Google, is quoted as saying Google does not have better algorithms, it only has more data. I think it should be clear by now that that's not the case, right? Google has a lot of very smart people working on bleeding edge algorithms, deep learning, you name it, and that's super important. And really this claim, which actually was related to another paper that he wrote, which is called the Unreasonable Effectiveness of Data, only applies in some cases. This is a plot from a famous paper, sorry, and now I blanked out, this is the author of this paper, Brill, and I forgot the other author, but I can look it up. But anyway, in this paper, they actually showed how, for some natural language processing task, in particular for sort of like predicting the meaning of some words, they use several different kinds of algorithms going all the way from the naive base to the perceptron, and they were looking at the results and plotting the test accuracy curves and they saw that it really did not matter so much what algorithm, I mean all these curves basically look the same. The only thing that matters is how many millions of words are actually you're throwing into the algorithm. So this was interpreted by some or by many with the conclusion of like, really the algorithm doesn't matter, what matters is you throw more data at it. Now the thing is, this is true only in some cases. This is true in the models that have low bias and where you can actually increase the accuracy because it's a high complexity model. In this case, this model has many parameters, words that are in the dictionary. So it has millions of parameters that you need to tune and definitely adding more millions of words it's going to help you increase the test accuracy. But in most cases, this is not the situation. So that's why adding more data doesn't always help. So again, there are some cases where adding more data either in the form of augmenting the features or just adding more training examples might help, but there are others that won't. This is a clear example of one case where it doesn't matter. So this was a production system Netflix that we had. It was an important one. And we just did the thought experiment. It wasn't even the thought experiment. We actually experimented with, hey, let's try to put more training examples. This is the million of training examples that we were throwing to the system. Just for fun, let's keep adding more and more training data and see how the test accuracy increases. Well, it turns out that after 2 million roughly or maybe 3 million training examples, the accuracy is basically stays the same. It's even slightly down because there might be some overfitting at some point, but it's basically flat after 3 million. So in this case, it doesn't really matter. You can throw as much data as you want, but the accuracy is not going to improve. So again, it's not always about having more data, sometimes about being better with the algorithm and the data that you have. Okay. So very much related to this, then there's the question of like, everyone talks about big data and having more data. Is it true that you always need all your data? And the answer is no. Sometimes you might have big data, might get in the way. And I've seen that happen many times. People get over-obsessed with using all the data they have. And then they obviously, you know, they can fit it into a single machine. They need to think about using a distributed approach for the algorithm. It takes a lot of infrastructure and a lot of people to do the algorithm. And they might have been able to get to the same result by doing just some smart, say stratified sampling and reducing the data set and using really the data that they have, but in a smart way. So think about it when you're facing one of these problems and yes, you might have a lot of data and it's, as we did at Netflix in the plot that I showed before, for fun, you can use it all of it. And you know, most of the time it's not going to hurt. But the real question is not about is it going to hurt. Is it going to make things unnecessarily complex and you could be doing the same thing by using a subset of your data and doing it in the right way. Okay. So the flip side or the complement to this is yes, sometimes you do need a more complex model. And again, it's not only about having more data. This is another paradox that I've seen many times in many places. And people have some features that they have developed for a given model. Imagine that you have a linear model, you're using, say, a logistic regression classifier and you've been working on generating features for that logistic regression for some time and then you say, hey, let me try a more complex model. Let me try some gradient-boosted decision trees. And all you do is basically take those features that you have for your logistic regression and throw it into your GBDT and it turns out that it doesn't improve. So your conclusion is, I don't need a more complex model. Well, that's false. You might not need a more complex model for the features that you have engineer. But the truth is, if you had better features and you had nonlinear features that allowed the more complex model to learn those nonlinearities, you might get better results. So the key takeaway here is more complex models may require more complex features and the other way around. More complex models may not show improvement because you're just feeding features that are too simple and those can be learned just as well by a linear model. So I think growing the complexity of the feature space at the same time that you grow your model complexity is a very important thing to keep in mind. Okay, hyperparameters, that's another thing that people usually forget about and it's very, very important. So just in case, I'm imagining people in the audience are familiar with what I mean by hyperparameters, but this is about how do I choose the regularization lambda for now to regularization in logistic regression and how do I use, how do I select the number of trees or the depth of the trees in a tree ensemble or the shrinkage factor in a gradient boosted decision tree. Those are the hyperparameters, right? It's not about the model itself, it's about those tuning parameters that you need to find. And usually you do want to automate that and it turns out, especially when you're thinking about an application, what's going to happen is that you're going to have one model, you're going to have data sets that keep changing all the time, because it's like you keep training and keep retraining the model, say every day or every week or every few hours, and it turns out that you need to actually tune those hyperparameters every time because they're not the same. The distribution of the features in the dataset might change, you might have things happening all the time that the optimal hyperparameter is not always the same and you need to retune it. So it's actually good to have a way to automate the choosing of your optimal hyperparameter. You can do that by simple grid search, basically just like run many training on different hyperparameters, hold out a validation set, then test on that validation set and see what hyperparameter does better. Now the thing is, at some point you end up having a curve like this one here where you have how the model accuracy sort of like basically evolves with, in this case, we're plotting the regularization lambda. So the question here is, what is the best point here that we can choose according to these numbers? So of course, if you just blindly look at what is the point that has the maximum accuracy, you see that the maximum accuracy is for a lambda of zero. And you're done, right? Like, okay, zero regularization, wrong. If you choose this point here, you're definitely overfitting, right? This is a typical plot where there's overfitting occurring. You're not regularizing, you have a lot of, you manage to get a very high accuracy because you're overfitting to your training set. So moving on, say, okay, I'm not gonna choose the zero, I'm just gonna exclude the zero lambda value because I do wanna have at least some form of regularization and penalty on the complexity of the model. So then you're left with all these points here, it's like, okay, should I choose zero one? The next higher up should be good. But it turns out that from zero one to one to 10 to 100, there's not such a big difference. So which one should you choose? The rule of thumb here is always choose the highest regularization that is within and you can decide what, in your case, makes sense. But you can say, given the optimal accuracy, in this case it wouldn't be zero nine, it would be zero 85. What are those values that are within, say, a 10, 20% of this maximum accuracy? And then choose always the one that reduces the complexity of the model the most. So in this case, my choice here would be for a lambda 100. It turns out this is something that we do in practice in many places. At least we started doing it at Netflix after realizing this was the thing that worked the best. I had not seen this explained by a professor until very recently, I'm actually taking this online course that Hasty and TV Shari from Stanford have online on statistical learning. And they have a similar recommendation. So I was happy to see that this is not way off from what even professors would recommend in this case. So always choose wisely about your hyperparameters. Don't just blindly take the one that gives you the highest accuracy and try to limit the complexity of your model as much as you can within a margin of sort of like best accuracy. Another thing which is related to this is like there are better ways than great search for searching hyperparameters. In particular, the approach of Bayesian optimization using Gaussian processes. So that usually is better and it's better not in the sense that you find a better optimal. It's better in the sense that it's faster. You don't need to search exhaustively all the different hyperparameter combinations. There's a bunch of packages right now that do this. Spearman, hyperop, AutoML, MOE, from Yelp. So all of those are available online and you can use to do hyperparameter searching in an efficient way. Okay. Moving on. Supervise versus unsupervise learning, like which ones should you use, which ones should you use when and how. So first off, I know we've all learned, there's a very clear distinction between supervisor and supervised learning. Supervise you have labels, unsupervise you don't have labels. And that sounds very nice in a book or in a class, but there's sort of like nuances to that and it's not really clear, like there's a very clear separation between what is supervised and what isn't supervised and what can you use both things for. So for example, unsupervise learning is something that sometimes we use as a way to just do dimensionality reduction, right? Things like PCA, ICA, matrix factorisation, even clustering, you can use it as a way to reduce the dimensionality of your problem. You can also use it to engineer features and that's something that I'm going to talk about, right? So basically you can have a highly dimensional sparse dataset and you need to generate features out of it. One way to do that is basically to use some form of unsupervise learning to generate features that then are learned by a supervised machine learning algorithm. Actually, I would say that combining unsupervise with supervised learning is one of the magic tricks that many people in companies or in practical applications do all the time and we don't usually talk about it so much. One example is, for example, if you have to use nearest neighbours which is the most naïve approach to sort of a classifier, that's very costly to do on most of the standard problems. However, this has been known for many years. You can use an initial clustering approach, just use unsupervise clustering first and then do a K&N on the result of your clustering, right? That's an example, simple example of combining unsupervise with supervised learning. Matrix factorisation, which is something that is used for collaborative filtering all the time, can also be interpreted as a combination of unsupervise and supervised. It's unsupervise because it's really doing dimensionality reductions, very similar to PCA or clustering because you can actually use matrix factorisation, particularly non-negative matrix factorisation or clustering, but it's also supervised because you have label targets and you are really doing a regression on those labels to predict them. So it's a combination of both approaches in a single algorithm. As a matter of fact, one of the magic bullets behind deep learning is that it also, in some sense, is doing both things, right? So I'm not going to go into detail into this, but here there are a couple of references on things that, for example, there's this plot here on the lower right where Yanle Kuhn was showing like, how do the different deep learning approaches or just general neural network approaches can be categorised into supervised and unsupervised. And they're all over the place, right? And you have some that are really somewhere in between and you can actually combine stack out encoders to do pre-training of some other neural net. You can train convolutional networks with unsupervised and supervised. In reality, whatever you do, the reality is that deep learning is really doing, at the same time, just similarly to what matrix factorisation is doing, it's doing a form of sort of like simplification of the input space, feature space, because it's usually very complex and a lot of sort of like dimensions into something that is more reduced, easy to learn, and also predicting labels at the same time, right? So the combination of these two things, it's what makes it also in some form magic and works so well for so many problems. Okay, now that gets me to the next lesson. It turns out that everything is an ensemble. So the Netflix prize was won by... When we usually think about the Netflix prize, we think about algorithms like matrix factorisation on one hand, restricted bolson machines, which is sort of like a form of neural networks, because those were the two most famous and the ones that got the better results on their own. But the reality is the winner team and all the other teams that were sort of like in the highest positions of the leaderboard were really using ensembles. So it wasn't about using one or two algorithms, about using hundreds and putting them in an ensemble. So initially the winning team was using grid and boosted decision trees to combine the different methods with an ensemble, a nonlinear ensemble. As a matter of fact, the winning solution, which was produced just a couple of weeks before they won, introduced neural nets to do the final combination of the ensemble. So this is another interesting example, right? You had a bunch of different methods doing different approximations. Many of them were unsupervised. And then you had a supervised neural net on top of that. So if you think about the layers that that algorithm had, it's very similar to deep learning, except that in this case it was manually sort of like stacking one algorithm after the other instead of stacking layers into a deep neural net. So the reality is most practical obligation to machine learning, run an ensemble. And the question is why wouldn't you, right? Like if you can use two algorithms, why use one? I mean, the only reason why you wouldn't is because it's costly to maintain two algorithms. But if you get performance wins, you actually should use as many as you can and you should combine them. You can add completely different approaches and combine them with an ensemble. Even in the case that I was talking at the beginning of like, should you add metadata to the user data that you have? Well, maybe it doesn't help so much, but if you fine-tune two different models, one to the metadata and one to the data that you have from your users, you can combine them with an ensemble and it can actually increase your accuracy and improve your results. You can also use different models at the ensemble layer. So you can go from complicated neural nets to simple logistic regressions to do the ensemble layer. The other important thing about this is that ensembles are the sort of like great way to turn any model that you have into a feature. And you don't know what to use. You don't know if you should use factorization machines or tensor factorization. Well, you don't really need to decide, right? You basically treat each model as a feature and then you feed that into the ensemble and let the ensemble learn which is better, for which cases. That's the other thing. It might turn out that for some cases or for some users, for part of your data, some is better, it's more predictive, the other one is better for some data. The ensemble will figure those things out. I don't know if you've read of your, heard about this book, The Master Algorithm by Pedro Dominguez. It's an interesting book. I know Pedro very well. Here's a Twitter exchange that we had before I actually presented this talk for the first time. The one thing I disagree with him is when he presents his solution of what should be the master algorithm. My point is that the master algorithm is really an ensemble because an ensemble is an uber algorithm where you can throw in everything else that you have and it's going to learn how to combine it. He doesn't disagree completely, by the way. I've talked to him after this. Okay. So the only problem with what I just mentioned is that turning any model into a feature that goes into another model is really nice, but it's a big nightmare for system design. So I won't go into this very deeply, but you need to, when you're designing systems where you're going to have models that are learning things that are then used by another model that is learning another thing that is then used by another model that is learning something else, and that's feeding back, it's complicated. So you need to be aware. There's a very good talk. This was the keynote at ICML last year by Leon Boutou from Facebook Research. I really recommend it. He talked about feedback loops in machine learning and similar issues. So I think the key takeaway maybe is that you should make sure that if you're in this particular situation where you're going to be using outputs of a model as inputs of the other, that you decouple that dependency as much as possible, right? And you make sure that if for whatever reason the distribution of the output of your first model changes, the second one doesn't all of a sudden break, or at least is able to retrain. That's why I was saying you need to retrain model constantly because the distribution of the first model for whatever reason has actually changed. It would be great actually if you could treat your machine learning infrastructure as we do with software. Remember, I also have a background on software engineering and I've taught software engineering at the UPF for several years. And in software engineering, there's a lot of design patterns and ways that you can decouple interfaces and make sure that if you change one, the contract between both interfaces is clear. In machine learning, that's messy and it's not really well-defined. So it would be great if we had better design patterns for machine learning, but we are just basically getting started there. So if there are people in the room that are thinking about interesting stuff that they want to study, I think design patterns for machine learning systems, that would be a great one. I also want to recommend this book in case you haven't read it, sorry, this paper by Scully and other people at Google about the high-interest credit card of technical debt. It's an interesting paper. It was presented at NIPS. First time it was presented in a workshop that I organized two years ago. But this last year was actually presented in a workshop by Scully. It's interesting. It's more interesting because of the problems that it uncovers, not so much because of the solutions, but I totally recommend it. Okay. Feature engineering. When you're designing features that are going to be used by any model and remember a feature can actually be the output of another model, there's different things that you need to try to get so you should make sure that features that you're designing can be reusable in many different situations as you can. Transformability is another one. You usually want to use features, but also basic transformations of features, right? If you can take the lock of a feature and change it to lock scale, that sometimes does wonders in some particular cases, especially if you're dealing with power law distributions and things like that, or nonlinear transformations like the max or aggregating over time windows, that's super important. You need to design your features so they can be not only reused but also transformed. It's also important that they're interpretable. So you need to understand what they mean and be able to interpret their values. Otherwise, things can get really messy and complicated at some point. It might not seem like it at the beginning because you only have five features and you know what they are, but when you have 500 and you have 5,000, then things are really hard to understand. Why did this change and why did this matter and why did this model all of a sudden break? Reliability, that's also important. So to give you an example, this is an example from Quora. So, in Quora, we have a very interesting problem which is given a question and a number of answers, how can we rank those answers from the best to the worst? And we need to think of how do we formulate this as a machine learning problem? So, I think this is an interesting case because it's not an obvious thing like what do you do with that? I mean, how do you even... what does a good answer mean and how do you rank answers and what does that even mean? So what we did, first of all is what is a good answer? We actually talked to the product team and it turns out the product team knows very well what a good answer is. They actually have a very clear definition of what is a good answer and they send it to us and we were looking at that and we thought what is a good answer and what is a good answer and what is a good answer has to provide an explanation, has to be well formatted, blah, blah, blah, blah, blah, and different things. Has to have references, whatever. That's their definition of a good answer. So, next thing is, we turn that definition into features of the interaction features. Like, was it upvoted, was it downvoted? Did people click on it? User features, like, what's the expertise of the user in a topic and so on. And finally, we train a machine learning model with, in this case, supervised with basically positive and negative labels to optimize the features that are actually enabling it to learn what we know that matters, right? But the question is, what do we know that matters is actually related to the problem itself? Okay. Implicit signals bit explicit ones almost always. This is another interesting learning that I've had over the years. So, this is something that might not be a common surprise, but if you haven't heard about it, this is something we talked about it before. There is this two-part blog post about Netflix recommendations beyond the five stars where we talk about the importance of other forms of recommendations that were not related to the five-star explicit rating that the user is giving. YouTube also posted at some point this blog post about how five stars dominated ratings and they needed to get rid of them i no ho va fer. I'm going to put it in the comments. It's a little bit more explicit. Feedback that the user is given without realizing that they're giving feedback tends to be more useful. And the question is why is that and is that always the case? Should we just forget about explicit feedback? Okay. So, one example, one way to look at this is if you look atàstima what is this. You see a that a lot of people decided to go and watch it. If you go and look at the highest rated feature films in 2014, it's kind of messy, all over the place. aquí, a la pròxdana. Hi ha coses que són molt les boies, i no sé com són, hi ha moltes ratings. Això illustra el problema. i que no hi ha cap accordant, no hem de parlar de la suposició, però a les xarxes ho hem de fer. Em sembla que sí que hi ha una altra cosa que el telèfon del PSG condensatiu ho ha dit el locals i al final, l' accent de laGeorgia. Ho ha passat amb el de Dany. Hi ha una xerrxa i hi ha el de la Ràdio, el de Barcelona, el de Barcelona, eh! Sí, hi ha les xarxes. Hi ha la xarxa. Hi ha les xarxes. Hi ha les xarxes. I, el de Barcelona, de laGorgia, i això és el de Barcelona, I el aktiu en canellament evident lies a idle centrale or actual motion and productions right show throw that's the difference between their crack already and what they say they do which is also something you will find when you compare implicit with explicit feedback so. Hi han de ser a les botonies, que fos el que s'ha de mirar. I ara, un particular cas d'oci és de clickbait. Si te'ls trobes en front de la gent que elcan les unsits, il·legal·l, i són imatges muy apel·lables o estrecades, Però es veu que això és un gran error, i ha triat molt de companys per optimitzar els models de machine-learning a la mateixa manera. I la reason per això és que està optimitzant per short term utilitari versus long term utilitari. Això és el diferent de mirar una persona i dir que aquesta persona ve a veure aquest moviment només una vegada i mai va a veure aquest moviment, o el long term, com si tractés el comportament a temps i veient que les persones que tenden a veure aquest moviment endavant de veure més de ells. I, si només veus un feedback implícit, et poden optimitzar per short term mètrics i llençar la gran foto. És clar, la solució no és una altra, és que t'haurien d'oblidar un feedback implícit i explícit i t'haurien d'oblidar en una manera que representa el gol de long term. I això, si veus el UI de coses com Cora, ens ensenyem en una manera que tenim una setmana de diferents feedbacks i sabem que podem utilitzar tot el feedback implícit, tenim molt de això, però també tenim el feedback explícit i la qüestió key aquí és com combinem en una manera que fa senz i correla amb el long term optimal. OK, perdó. És un expert molt conocit en el field. Què és molt de la qüestió d'informatiu? Però que ningú no li agrada i ningú no li agrada. Llavors, hi ha molt de nuances sobre com definir el teu positiu i el negatiu. Una particularna noia que... No aniré a detalls. És el que ensenya el temps. És important que la teoria de la teoria és definir en una manera que no conté la solució al problema. I pot dir-ho a tu, quanta vegada això passa a les persones. En el cas de Netflix, sempre teníem això de dir que quan algú va venir a mi i diu que he augmentat l'acròstia d'aquest algoritme, per 80% va dir que el segon qüestió va dir que haurem de veure el temps treballant. ¿Està utilitzant labels en el seu dret d'estat? O ¿està utilitzant data del futur en el seu dret d'estat? I això passa moltes vegades, fins i tot amb moltes dret d'estats, que s'obliden a veure i que s'utilitzen data del futur quan intenten predictar i, evidentment, que s'obliden moltes resultats. Ok. I aniré a aquest. Aquesta pot ser una mica controversial i és per això que crec que pot ser interessant. A més de la gent, no necessites distribuir el seu dret d'estat. I això passa a la meva discussió al capdavant sobre el dret d'estats i les dretes de la data, però hi ha més a això. Jo crec que a més de les casos de utilització que he vist i sé, i ho traïm amb companyes que tenen molt de dret, a més de la gent que fa en practicat pot fitar a una màxima màxima dret d'estat. A més, això és... Obviamente, et paralitzes perquè utilitzes les diferents cordes en una màxima màxima, però no et distribueixes i això em fa molt més ràpid. I per què és això? Doncs... First of all, you can do smart data sampling, as I said before. Second, sometimes you don't need to do things in a way that they're available right away. So you can do offline schemes where you're basically processing one step then processing the other one. You can do a couple different steps in your algorithm. And the final one is you can do very efficient things in parallel code in a single machine without the need of having to use network and having to send data across the network. So for most use cases except large scale deep learning, that's the only one that I would say definitely need to go to distributed. Most of the things you can fit into a single machine. Actually, I always caution people about the dangers of things like Hadoop and Spark, which by the way we use all the time at Quora and Netflix, but it's dangerous because it leads people to not care enough about being efficient with their computation. Here's an example of this. We had an implementation in Spark here at Quora of an algorithm. It took six hours and 15 machines to run the code. And it took that time just because we hadn't cared enough about it. Like somebody coded it, it was run on the Spark cluster and it was fine until we decided to look into it and say, hey, can we optimize this? So we put a developer, a C++ developer for four days to look into it and he turned the computation into a single machine and it took 10 minutes to compute this exact same thing. So from six hours, 50 machines to 10 minutes, one machine. Of course, I mean, I'm not claiming that you can do that with everything, but the fact that you have this distributed very easy to use environments like Hadoop and Spark also leads people to being careless about the efficiency of their computation. Okay, so I'm gonna skip this and basically finish with this. The last one, the story of data science and machine learning engineering. That's an interesting one. I don't know how much right now this is something that is being talked about in Barcelona, but here there's a lot of companies trying to figure out like how do they configure their teams to be successful? What is a data scientist? What is the difference between data science and machine learning and engineering and how do they fit those things into the organization? Now the thing is we've all heard about the definition of data science and there's a lot of very good, even answers on quora, but it's like what the combination of sort of like a little bit of hacking, a little bit of math and statistics or a lot of math and statistics and some expertise in the domain and then this is how you combine that into a data scientist. And that's okay, but the question is not about like, what is a data scientist? Which is a good one, it's more about like, okay, I have a company, how do I put the data science team there and what does the data science team do and how does it integrate with the engineering team that is actually building the systems that then run things? So ideally you would wanna have strong data scientists that also are good engineers and then that would be the end of the question, right? But turns out that's really hard, right? It's like it's really hard to have PG on a statistics and math and at the same time be a hardcore developer i sort of like implement efficiency plus plus code. So there are a couple of those unicorns, as we call them, but most of the time you'll find people are really good on one thing, really good on the other. So how do you do that? So my proposal and this is what we do here at Quora is to think about the machine learning innovation funnel as a three part thing. And that includes the first part, which is data research, the second part, which is machine learning exploration product design and the third part, very important, AB testing, which I haven't talked much about today. So the first part, data research, is basically in charge of finding hypotheses, looking at the data and figuring out what are even the problems that we need to solve and what are the things that need to be addressed because they're not working or they could work better. This is definitely somewhere where data science teams do their best, right? They're doing research, just diving into the data, coming up with hypotheses, this belongs to data science. Part two, that's where the machine learning solution comes in. It's like, okay, we have a hypothesis, we know that our users in this particular segment are not clicking enough on CAD videos and we need to improve that and we need to find a solution for that. With the feedback from the research phase, we build a solution that falls into the machine learning engineering part and of course, in combination with product management or product team. Finally, part three, AB testing. We have a solution, we put it out there, we look at the data from our users and we have version A, version B, we compare, we decide which one work. That's also owned by data science. So the nice thing about this is that this is a iterative approach. So once you're done with AB testing, you go back to data research and that's how this combines back with data science. So basically you have data science in part one and three and then goes back to part one because we're doing more research. So you can do that iteratively and very quickly, even in the context of a single team and it works really well. You need to have people here in between the machine learning engineers that are able to understand a little bit of the data research and a little bit of the AB testing of course because this is gonna happen sort of like very quickly but that's easier to find than not to find sort of like the unicorns that we were talking about before. Okay, so I'm almost over time. Some conclusions here. If I have to summarize all these lessons into one slide I would say you have to choose the right metric, you have to be thoughtful about your data, understand dependencies between data models and systems, optimize only those things that matter and also be thoughtful about your machine learning infrastructure, your systems and your tools and about how you organize your teams. Okay, so I'm gonna take some questions now. I'm also going to say that there is a session that I'm going to be answering questions on Quora. It's open now, you can add your questions now if you go into it. I'm gonna be answering next Tuesday in the Quora page in case people have more questions and want to follow up later. Thank you. Hola. Hola. Bueno, moltes gràcies. Moltes gràcies. Estic fantàstic. Bueno, jo estic molt bé. Estic fantàstic. Potser els aplaudiments no s'han sentit perquè estava... No s'han sentit perquè estava... Mutat? Mutat. Mm-hm. Jo tinc una primera pregunta. Jo tinc una primera pregunta. Posaré com un exemple. Jo havia treballat... Jo havia treballat amb... Perdona. Perdona. Vale, em sembla que no hi ha manera de treure. No hi ha manera de treure. Jo havia treballat amb models de tràfic. Puc entrar, per exemple, en model per fer... Ah, moltes gràcies. Puc entrar en un model per fer predicció de tràfic, però ja em surt que la predicció és molt bona. Però a l'hora de la veritat no m'interessen totes les situacions, m'interessen les situacions de congestió, no? I això és una cosa que passa sovint. Potser pots fer un model que tingui una cura sí molt bona, però a l'hora de la veritat només t'interessen... Estàs interessat en certs casos. Una manera de resoldre això és posar pesos a les observacions i dir, bueno, les observacions que més m'interessen, les potenció més, i les que m'interessen menys, les potenció menys en el training. Però aquesta manera de posar els pesos és com explícita. Podria posar, jo què sé, multiplicador del davant, que digués pel volum de tràfic, per exemple. Això, per exemple, també pot surgir en casos en què tinguis dades que no estan balancejades, que tens poques dades, però unes en particular són les que més t'interessen, i aleshores el que pots fer és posar-hi pesos. És una cosa com molt manual, però tampoc no sé fins a quin punt es fa servir, no es fa servir, per mi té sentit, però no sé, no he vist gaire vegades. I no sé... O sigui, el tema està en com incorporar els costos de la vida real dintre de l'entrenament del model. Bé, havia mutejat el micro, perquè em sembla que quan el tinc obert s'acopla o alguna cosa. Sí, és molt bona pregunta. Amin, de fet, en el tema dels dades sets no balancejats, és una cosa que ens trobem molt sovint. I hi ha moltes maneres de treballar-hi i de trobar-hi solucions. Entre d'altres, el tema del samplejat i de assegurar-te que estàs valorant. Per exemple, el que estàs comentant és molt habitual de que passi quan estàs treballant també amb usuaris. En aquest cas, pot ser molt sovint que no t'interessis donar el mateix tipus de pes a tots els usuaris. O que si fas un samplejat uniforme dels teus usuaris, la representació que tinguis no sigui adequada. Perquè realment, per exemple, si et trobes una web de e-commerce, pot ser que t'interessin que la gent que compra més tingui més pes i més valor en el model que estàs entrenant. Per exemple, si et trobes un sampleat uniforme de les usuaris, és bastant obvi. I, llavors, formes de... de fer que això funcioni, el que has comentat tu de donar pesos a les mostres que tens a l'entrenament es fa i es fa bastant sovint. I, de fet, hi ha formes bastant documentades de com tractar-us, com a... Comandar abans de hyperparàmetres i per paràmetres del teu algoritme, perquè tu pots atunejar aquests pesos a la vegada que estàs intentant optimitzar una funció de cost en el teu dataset de test. En certa manera, imagina que tens uns pesos que es converteixen en paràmetres del teu model, i llavors tu dius, el meu test set el convertiré, i el que tinguis a l'exemple que tenies tu estarà format, sobretot, de moments en què la congestió de tràfica era molt alta. Llavors, aquest mateix procés d'entrenament i de validació amb l'objectiu que tu tens et pot fer determinar quins són els pesos que estàs donant de forma correcta. Això és només un exemple. Jo crec que la solució és la combinació de tres coses. Un, és l'acció de escollir el set d'entrenament que és més adequat fent el sampling adequat en el teu dataset per representar el que tu vol representar. Segona, el que comentaves tu, donar pesos que pots aprendre al mateix procés d'entrenament com a paràmetres del teu model és també fer el mateix a ficar algun tipus de pes. No he parlat d'això, però pots fer coses bàsicament amb donar pesos en el teu objecte i function, la funció objectiu que estàs optimitzant en el teu algoritme. Que seria semblar en el que has dit tu de donar pesos en el entrenament però donar-lo en la funció que estàs intentant aproximar el meu objectiu. No sé si contestava la teva pregunta. Sí, sí, merci. Sí, sí, merci. Més preguntes? Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. Més preguntes. i no que el nostre À classroom no és un informe que hi ha dalt um del pèrtre sense que l'amplice hagi equipatat, no és un informe que hagués aparegut com a informe, no és un informe que hagués uso com a informe, com a informe, com a informe que hagi acafat com una informe que hagi acafat com una informe que hagués acafat que són les factors contextuals que són molt difícil de prendre en compte per a música que són més fàcils de prendre en compte en filmes. No és tant sobre el taste personalitzat. És sobre el fet que a moltes persones tenen un taste molt brot de música i es importa més què és el context, què és el mood, què és la situació personal ara, que és la meva overall taste. Just to give you an example, I think that right now Spotify is doing a pretty good job in their... I don't know how it's called in Spanish, but in English it's the Discover Weekly. I use Spotify all the time and I talk to people that use it all the time. And I think the Discover Weekly in Spotify is pretty good when you don't know what to listen to and you're like, I'm tired of listening to the same music and you click on Discover Weekly, it's pretty good. You've already predefined and predetermined that you're in the mood for discovery. So I think that's why it's good. But in other cases, I mean Spotify should know it like, is it raining outside? Am I in a bad mood? Am I sad? Am I happy? Do I need to just get uplifted? Am I just going on a run and I need some music that is good for a run? Am I with friends? Am I in the car? So all these contextual things matter so much that I think that's what makes music recommendation more difficult. By the way, those things also matter in Netflix recommendations. The only thing is that I think the variability is not as high as it is in music. In other words, in movie recommendation or in Netflix recommendation, you have a few thousand things that you can recommend and if you are more or less certain of the patterns, people don't have that many different patterns. They usually watch TV shows during the week and then they watch movies in the weekend or there's a couple of variations. But I think in music, because the unit that you're recommending is so small and there's so much variability that becomes much harder. So it's not so much about the algorithmic solutions which are pretty similar. It's more about the variations and how do you integrate that into the activity itself. That would be my guess. By the way, there's a lot of people in the U.S. I'm not sure about so much in Spain that use Pandora, which is a purely recommendation-based radio station. So if you compare Pandora to just standard radio station, I think Pandora is way better. So if you compare it to the recommendations on Netflix, maybe it's a different situation because of the context and the mood and all these different variables. Diu que gràcies. Les preguntes? Hi, thanks very much for the talk. Can you hear me? Yeah. I wanted to ask, so from the things that you've presented, you're presenting from the side more using machine learning as an engineering tool. So making, let's say, a black box that has some question enters and some answer comes out, right? But machine learning can also be used as a way to understand some phenomena. So training a model on top of some data and then by looking at the parameters of the model, understanding some things about the phenomenon itself. How do you think this is aligned with the new trends now with using ensemble models or with using very complicated neural networks where it's very hard to understand what's happening inside the model? Yeah, that's it. Yeah, thanks for the question. It's a great question. As a matter of fact, I did have one lesson that I skipped which was about why you should care about understanding what's inside of the model. So I think it's a very important thing and it's a very important question, no matter what your application is. So your point, which I totally understand, is like there's some cases in which you're actually, your only result of the model is going to be purely understanding some phenomenon that is going on in your data or in your system. And that's part of the data research element that many data scientists do. And that's really important. But even if you're just building a model to treat it as a black box, being able to understand what's going on inside of the model becomes very, very important. So I totally agree with you, sort of like black box models where you don't know what's going on very well and it's hard to understand what each of the different elements in the model are actually doing are really hard because among other things they make innovation very difficult because you have a model, you tune it, it works, or it gives you whatever kind of accuracy. And now the next thing is like, oh, I want to make it better or I want to make it better for this situation or I want to make it better for these people. Now what do you do? You don't really know. So you basically have to, in the dark, sort of like try many new things to see which ones work and which ones don't work. Now, I don't entirely agree that ensemble models are in that situation. You can have an ensemble model. You can have visibility into feature importance just as you do with many other models. You can find out which features matter, which features don't matter. You can do things like, you know, one at a time and seeing which ones work, which ones don't work. Obviously, if those features are built from a different model themselves, then you have to go to those models and figure out what's going on. So I'm not saying it's easy, but it's not impossible. Now, a different story is, you know, if you have an ultra-deep learning network with many layers, it's really hard to really capture and figure out like what's happening on each node of that network. But that's why most of the applications of neural nets are currently sort of like, mostly sort of like very well-defined targets, right? Like detect images of this kind in this dataset or interpret words in this vocabulary. I think building a model for sort of like things like predicting user behavior or things like that, that you're going to need to iterate and adapt and evolve over time can be done. I mean, I've done things using deep learning for that. But the cost of the obscurity of the system, it's really high. So you need to be careful with it. Good question. More questions? More questions? Hello? Thanks for the talk. It was very good. My question is about retaining the model. So once you have deployed the model, how do you usually approach the practical aspects of having to retrain it or maintain it? Have you heard of it? Ah. Now you have your phone. Your microphone is not working now. Oh, okay. Sorry, sorry. I had to turn it off. Okay. Yes, I heard. Model retraining. How is that done in practice? So, yes. That's a, it's a, it's a very important thing. And it's a little bit tricky to sort of like implementing practice, but it does make a lot of difference. I mean, you can really rely on training a model on data that is dynamic and hoping that it's just going to work forever. As a matter of fact, I've seen people that they they've told me that in some companies they're more successful, but they're not. And they're like, I'm going to do a full AB test. Are the ones where they basically just retrain the model. And then they test it again. It's like, oh, things are much better now. So that's a clear signal that if you're doing that, you should be, you know, retraining more frequently and doing it in an automated way. Now, how do you do that? So basically. I mean, you have to have an infrastructure that allows you to collect your training. And then feed it into your training model. So that's relatively easy to do. I mean, you can have a cron job saying every night at 10pm, just look at the data set that will be available here, take it in and train this GBDT. So that's not usually so tricky. The more tricky part is like, what happens if that fails? What happened if this model that I trained now is worse than the one I had yesterday? What do I do with the hyperparameter tuning? All those things are the tricky things. So just to go quickly over some of them. I think hyperparameter tuning, I already mentioned in my talk, like things you can do about it. I think very importantly, the one thing you need to be careful about is, how do you decide if the model that you train today is actually worth using or not? I think that's... I mean, there's... I can't give an solution that works always, but the rule of thumb would be, you train a model, you test it on a holdout data set that you have, and the performance of your model should be better than the one you had yesterday. If it's not better, or particularly if it's much worse, you should not use it. So you basically skip using that model and you wait for the next day. So even in the best cases, where you have a very sophisticated infrastructure, that happens sometimes. There might be something happening in the data set, some weirdness in that day, it was Super Bowl Sunday, and people were not watching Netflix, and then you didn't get enough data. I don't know, there could be many things, right? Outliers from different... So I think once you automate this kind of process, you also need to put the measures in place to make sure that you're not shooting yourself in the foot by using a model that it's worth and what you expected. Okay. Alish? Alish, I'm going to have to leave at 11.30, just in case you want to have one or two last questions. 11.30 is my time, so that's five minutes. Any last one? Let's make it quick. Thank you for the talk. Just quick question. You talk about, sorry for my voice, hyperparameter tuning, and that works well with batch environments where you might have cross-validation techniques. How do you apply or did you find situations where you had streaming scenarios and you might have algorithms that need to be tuned in terms of parameterization, but you cannot go and perform several passes on the data because you just have one pass and the data is gone. So how do you apply these situations into streaming scenarios with the new algorithms that are coming up nowadays? Thank you very much. Thank you very much. Okay. Yeah, it's a good question. I don't think I have a great solution for that. I think streaming algorithms and even online learning situations are a little bit tricky in many senses, and that's one of them. So I think in those cases, the going towards sort of like the online learning scenario is the best thing you can do, que tot i que sticking with ango. I you just need to make sure and define your approach in a way de multi-arm bandit, approaches to online learning, you have the same situation, same problem. Actually, you don't really know, you could go through a local minima at some point, and your hope and your expectation is that eventually, the model gets good enough that it was actually worth it. So you can't have this kind of approach that I was saying before where you're making sure that every step that you're taking is better than the previous one. As you pointed out, that works well for batch training approaches. In this other cases, you just need to make sure that you implement the system and the algorithm in a way that it tends to go to a global optimal, but it might eventually be in local negatives. So I think, as you pointed out, this is a very different domain, a different world, and that kind of approach doesn't work. And the only solution that I have is just make sure that your algorithm converges eventually to something that is optimal. And sometimes even ensuring that it's not trivial, you need to even run through offline simulations and make sure that eventually that's happening, and always measure what's going on, right? So like have online metrics that are telling you, like this is converging and this is going well, and don't panic if there's a local sort of like, oh, this went down today or this went down now, because that might actually be part of the streaming algorithm and the sort of like online learning process. Sorry, I don't have a more concrete answer. Okay, thank you very much. Let's leave it here because it's 830 here. So thank you very much. And thank you. And I think that's all. Okay. Thank you very much.