 I thought what we might do today is to finish off where we were in this Rossman notebook, looking at time series forecasting and structured data analysis. And then we might do a little mini review of everything we've learnt, because believe it or not, this is the end, like there's nothing more to know about machine learning other than everything that you're going to learn next semester and for the rest of your life. But anyway, I've got nothing else to teach. So we'll do a little review and then we'll cover the most important part of the course, which is thinking about how are ways to think about how to use this kind of technology appropriately and effectively in a way that's a positive, hopefully a positive impact on society. So last time we got to the point where we talked a bit about this idea that when we were looking at like building this competition months open derived variable, but we actually truncated it down to be no more than 24 months and we talked about the reason why being that we actually wanted to use it as a categorical variable because categorical variables thanks to embeddings have more flexibility in how the neural net can use them. And so that was kind of where we left off. So let's like keep working through this because what's happening in this notebook is stuff which is probably going to apply to most time series data sets that you work with. And as we talked about like although we use DF.apply here, this is something where it's running a piece of Python code over every row and that's horrifically slow. So we only do that if we can't find a vectorized pandas or numpy function that can do it to the whole column at once. But in this case I couldn't find a way to convert a year and a week number into a date without using arbitrary Python. Also worth remembering this idea of a lambda function. Any time you're trying to apply a function to every row of something or every element of a tensor or something like that, if there isn't a vectorized version already, you're going to have to call something like dataframe.apply which will run a function you pass to every element. So this is something like, you know, kind of basically a map in functional programming. Since very often the function that you want to pass to it is something you're just going to use once and then throw it away, it's really common to use this lambda approach. So this lambda is creating a function just for the purpose of telling DF.apply what to use. So we could also have written this in a different way, which would have been to say define create promo2sense on some value return. And then we could put that in here. Okay, so that and that are the same thing. So one approach is to define the function and then pass it by name, or the other is to define the function in place using lambda. And so if you're not comfortable creating and using lambda's good thing to practice and playing around with DF.apply is a good way to practice it. Okay, so let's talk about this durations section, which may at first seem a little specific, but actually it turns out not to be. What we're going to do is we're going to look at three fields, promo, state holiday and school holiday. And so basically what we have is a table of for each store, for each date, does that store have a promo going on at that date? Is there a school holiday in that region of that store at that date? Is there a state holiday in that region for that store at that date? And so this kind of thing is events and time series with events are very, very common. If you're looking at oil and gas drilling data, you're trying to say the flow through this pipe, here's an event representing when it set off some alarm. Or here's an event where the drill got stuck or whatever. So like most time series at some level will tend to represent some events. So the fact that an event happened at a time is interesting itself, but very often a time series will also show something happening before and after the event. So for example, in this case we're doing grocery sales prediction. If there's a holiday coming up, it's quite likely that sales will be higher before and after the holiday and lower during the holiday if this is a city-based store. Because you've got to stock up before you go away to bring things with you. And when you come back you've got to refill the fridge, for instance. So although we don't necessarily have to do this kind of feature engineering to create features specifically about like this is before or after a holiday, the neural net, the more we can give the neural net like the kind of information it needs, the less it's going to have to learn it, the less that it's going to have to learn it, the more we can do with the data that we already have, the more we can do with the size architecture we already have. So feature engineering, even with stuff like neural nets, is still important because it means that we'll be able to get better results with whatever limited data we have, whatever limited computation we have. So the basic idea here therefore is when we have events in our time series, as we want to create two new columns for each event, how long is it going to be until the next time this event happens and how long has it been since the last time that event happened. So in other words, how long until the next date holiday? How long since the previous date holiday? So that's not something which I'm aware of as existing as a library or anything like that. So I wrote it here by hand. And so importantly, I need to do this by store. So I want to say for this store, when was this store's last promo? So how long has it been since the last time it had a promo? How long it will be until the next time it has a promo, for instance. So here's what I'm going to do. I'm going to create a little function that's going to take a field name and I'm going to pass it each of promo and then state holiday and then school holiday. So let's do school holiday, for example. So we'll say field equal school holiday. And then we'll say, get elapsed school holiday comma after. So let me show you what that's going to do. So we've got a first of all sort by store and date. So now when we loop through this, we're going to be looping through within a store. So store number one, January the first, January the second, January the third and so forth. And as we loop through each store, we're basically going to say like, is this row a school holiday or not? And if it is a school holiday, then we'll keep track of this variable called last date, which says this is the last date where we saw a school holiday. And so then we're basically going to append to our result the number of days since the last school holiday. That's the kind of basic idea here. So there's a few interesting features. One is the use of zip. So I could actually write this much more simply. I could say, let's go through. Well, we could basically go through like for row in df.itter rows, right? And then grab the fields we want from each row. It turns out this is 300 times slower than the version that I have. And basically like iterating through a data frame and extracting specific fields out of a row has a lot of overhead. What's much faster is to iterate through a NumPy array. So if you take a series like df.store and add dot values after it, that grabs a NumPy array of that series. So here are three NumPy arrays. One is the store IDs. One is whatever field is, in this case that's a school holiday. And one is the date. So now what I want to do is loop through the first one of each of those lists. And then the second one of each of those lists and then the third one of each of those lists. And like this is a really, really common pattern. I need to do something like this in basically every notebook I write. And the way to do it is with zip. So zip means loop through each of these lists one at a time. And then this here is where we can grab that element out of the first list, the second list, and the third list. So if you haven't played around much with zip, that's a really important function to practice with. Like I say, I use it in pretty much every notebook. I write all the time. You have to loop through a bunch of lists at the same time. So we're going to loop through every store, every school holiday, every date. Yes. So is it looping through all the possible combinations of each of those? No. Yeah, exactly. Thanks for the question. So in this case, we basically want to say, let's grab the first store, the first school holiday, the first date. So for store one, January the first school holiday was true or false. And so if it is a school holiday, I'll keep track of that fact by saying the last time I saw a school holiday was that day. And then append, how long has it been since the last school holiday? And if the store ID is different to the last store ID I saw, then I've now got to a whole new store, in which case I have to basically reset everything. Okay, could you pass that to Karen? What will happen to the first points that we don't have like last holiday? Yeah, so I just set, I basically set this to some arbitrary starting point. It's going to end up with like the, I can't remember, it's either largest or the smallest possible date. Okay, thanks. And you may need to replace this with a missing value afterwards or some, you know, the zero or whatever. So the nice thing is though, thanks to values, it's very easy for a neural net to kind of cut off extreme values. So in this case, I didn't do anything special with it. I ended up with these like negative a billion day time stamps and it still worked fine. Okay, so we can go through. The next thing to note is there's a whole bunch of stuff that I need to do to both the training set and the test set. Right. So in the previous section, I actually kind of added this little loop where I go for each of the training data frame and the test data frame, do these things. Right. So I kind of, you know, each cell I did for each of the data frames. So I've now got a whole coming up a whole series of cells that I want to run first of all for the training set and then for the test set. So in this case, the way I did that was I had two different cells here, one which set DF to be the training set, one which set to be the test set. So the way I use this is I run just this cell. Right. And then I run all the cells underneath. Right. And then I come back and run just this cell and then run all the cells underneath. Okay. So like this notebook is not designed to be just run from top to bottom, but it's designed to be run in this particular way. And I mentioned that because like this can be a handy trick to know. Like you could of course put all the stuff underneath in a function that you pass the data frame to and call it once with a test set and once with a training set. I have to experiment a bit more interactively look at each step as I go. So this way is an easy way to kind of run something on two different data frames without turning it into a function. Okay. So this is going to if I sort by store and by date, then this is keeping track of the last time something happened. And so this is therefore going to end up telling me how many days was it since the last school holiday. Right. So now if I sort date descending and call the exact same function, then it's going to say how long until the next school holiday. Right. So it's a kind of a nice little trick for adding these kind of event timers arbitrary event timers into your time series models. Right. So if you're doing, for example, the Ecuadorian groceries competition right now, you know, maybe this kind of approach would be useful for various events in that as well. Do it for state holiday do it for promo. There we go. Okay. The next thing that we look at here is rolling functions. So rolling functions is how we rolling in pandas is how we what we call create what we call windowing functions. So let's say I had some data, you know, something like this. Right. And this is like date. And I don't know this is like sales or whatever. What I could do is I could say like, okay, let's create a window around this point of like seven days. Right. So it'd be like, okay, this is a seven day window. Say, right. And so then I could take the average sales in that seven day window. And then I could like do the same thing like I don't know over here. Right. Take the average sales over that seven day window. Right. And so if we do that for every point and join up those averages, you're going to end up with a moving average. Okay. So the kind of the more generic version of the moving average is a window function. I something where you apply some function to some window of data around each point. Now, very often the windows that I've shown here are not actually what you want. If you're trying to build a predictive model, you can't include the future as part of a moving average. Right. So quite often you actually need a window that's ends here. So that would be our window function. Right. And so pandas lets you create window function arbitrary window functions using this rolling here. This here says how many time steps do do I want to apply the function to. Right. This here says if I'm at the edge. So in other words, if I'm like out here. Should you make that a missing value? Because I don't have seven days to average over or what's the minimum number of time periods to use? So here I said one. Okay. And then optionally you can also say do you want to set the window at the start of a period or the end of a period or the middle of the period. Okay. So and then within that you can then apply whatever function you like. Okay. So here I've got my weekly by store sums. Okay. So there's a nice easy way of getting kind of moving averages or whatever else. And I should mention in pandas if you go to the time series page on pandas, there's literally like look at just the index here time series functionality and all of this, this, this, right. There's lots because like West McKinney who created this, he was originally in hedge fund trading, I believe, and you know, his work was all about time series. And so I think like pandas originally was very focused on time series and still, you know, it's perhaps the strongest part of pandas. So if you're playing like if you're playing around with time series computations, you definitely owe it to yourself to try to learn this entire API. And like it there's a lot of kind of conceptual pieces around like time stamps and date offsets and re sampling and stuff like that to kind of get your head around but it's totally worth it because otherwise you'll be writing this stuff as loops by hand and it'll take you a lot longer than leveraging what pandas already does. And of course pandas will do it in, you know, highly optimized C code for you vectorize the C code where else your version is going to loop in Python. So definitely worth, you know, if you're doing stuff in time series learning the, the full pandas time series API is about as about as strong as any time series API out there. Okay, so at the end of all that you can see here's those kind of starting point values I mentioned slightly on the extreme side. And so you can see here the 17th of September store one was 13 days after the last school holiday, the 16th was 1211 10 so forth. We're currently in a promotion. Right here. This is one day before a promotion. Here we've got nine days after the last promotion, and so forth. Okay. So that's how we can add kind of event counters to a time series, and probably always a good idea when you're doing work with time series. So now that we've done that, you know, we've got lots of columns in our data set. And so we split them out into categorical versus continuous columns. We'll talk more about that in a moment in the review section. But so these are going to be all the things I'm going to create an embedding for. Okay, and these are all of the things that I'm going to feed directly into the into the model. So, for example, we've got like competition distance that's distance to the nearest competitor, maximum temperature. And here we've got day of week. Right. So, so here we've got maximum temperature. Maybe it's like 22.1. So use centigrade in Germany. We've got distance to nearest competitor might be 321 kilometers point seven. Right. And then we've got day of week, which might be, I don't know, maybe Saturday is a six. Okay. So these numbers here are going to go straight into our vector, right, the vector that we're going to be feeding into our neural net. Right. 22.1, 321.7. Okay. We'll see in a moment we'll actually we'll normalize them, but more or less. But this categorical variable we're not, we need to put it through an embedding. Right. So we'll have some embedding matrix. Right. Of if there are seven days by, I don't know, maybe dimension four embedding. Okay. And so this will look up the sixth row to get back the four items. Right. And so this is going to turn into length four vector, which we'll then add here. Okay. So that's how our continuous and categorical variables are going to work. So then all of our categorical variables will turn them into pandas categorical variables in the same way that we've done before. And then we're going to apply the same mappings to the test set. Right. So if Saturday is a six in the training set, this apply cats make sure that Saturday is also a six in the test set. For the continuous variables, make sure they're all floats because PyTorch expects everything to be a float. So then this is another little trick that I use. Both of these cells define something called joined SAMH. One of them defines them as the whole training set. One of them defines them as a random subset. Right. And so the idea is that I do all of my work on the sample. Make sure it all works well, play around with different hyper parameters and architectures. And then when I'm like, okay, I'm very happy with this. I then go back and run this line of code to say, okay, now make that make the whole data set be the sample and then rerun it. Okay, so this is a good way again similar to what I showed you before. It lets you use the same cells in your notebook to run first of all on a sample and then go back later and run it on the full data set. Okay, so now that we've got that joined SAMH, we can then pass it to PROC-DF as we've done before to grab the dependent variable to deal with missing values. And in this case, we pass one more thing, which is do scale equals true. Do scale equals true will subtract the mean and divide by the standard deviation. And so the reason for that is that if our first layer, you know, it's just a matrix model ply, right? So here's our set of weights. And our input is like, I don't know, it's got something which is like 0.001. And then it's got something like, which is like 10 to the 6, right? And then our weight matrix has been initialized to be like random numbers between 0 and 1, right? So we've got like 0.6, 0.1, etc. Then basically this thing here is going to have gradients that are 9 orders of magnitude bigger than this thing here, which is not going to be good for optimization. Okay, so by normalizing everything to be mean of 0, standard deviation of 1 to start with, then that means that all of the gradients are going to be, you know, on the same kind of scale. We didn't have to do that in random forests, right? Because in random forests, we only cared about the sort order. We didn't care about the values at all, right? But with linear models and things that are built out of layers of linear models, like neural nets, we care very much about the scale. Okay, so do scale equals true normalizes our data for us. Now since it normalizes our data for us, it returns one extra object, which is a mapper, which is an object that contains for each continuous variable what was the mean and standard deviation it was normalized with. The reason being that we're going to have to use the same mean and standard deviation on the test set, because we need our test set and our training set to be scaled in the exact same way, otherwise they're going to have different meanings. Okay, and so these details about making sure that your test and training set have the same categorical codings, the same missing value replacement, and the same scaling normalization are really important to get right, because if you don't get it right, then your test set is not going to work at all. Okay, but if you follow these steps, you know, it'll work fine. We also take the log of the dependent variable, and that's because in this Kaggle competition, the evaluation metric was root mean squared percent error. So root mean squared percent error means we're being penalized based on the ratio between our answer and the correct answer. We don't have a loss function in PyTorch called root mean squared percent error. We could write one, but easier is just to take the log of the dependent because the difference between logs is the same as the ratio. Okay, so by taking the log, we kind of get that for free. You'll notice like the vast majority of regression competitions on Kaggle use either root mean squared percent error or root mean squared error of the log as their evaluation metric, and that's because in real world problems, most of the time we care more about ratios than about raw differences. So if you're designing your own project, it's quite likely that you'll want to think about using log of your dependent variable. So then we create a validation set, and as we've learned before, most of the time if you've got a problem involving a time component, your validation set probably wants to be the most recent time period rather than a random subset. Okay, so that's what I do here. When I finished modeling and I found an architecture and a set of hyperparameters and a number of epochs and all that stuff that works really well, if I want to make my model as good as possible, I'll retrain on the whole thing, including the validation set. Now currently, at least fast AI assumes that you do have a validation set, so my kind of hacky work around is to set my validation set to just be one index, which is the first row, and that way like all the code keeps working, but there's no real validation set. So obviously if you do this, you need to make sure that your final training is like the exact same hyperparameters, the exact same number of epochs, exactly the same as the thing that worked, because you don't actually have a proper validation set now to check against. I have a question regarding get-elapsed function which we discussed before. So in get-elapsed function, we are trying to find when will the next holiday come? How many days away is it? Every year the holidays are more or less fixed, like there will be a holiday on 4th of July, 25th of December, and there's hardly any change. So can't we just look from previous years and just get a list of all the holidays that are going to occur this year? Maybe. In this case I guess that's not true of promo, and some holidays change like Easter. So this way I get to write one piece of code that works for all of them. And it doesn't take very long to run. So there might be ways. If your dataset was so big that this took too long, you could maybe do it on one year and then somehow copy it, but in this case there was no need to. I always value my time over my computer's time. So I try to keep things as simple as I can. So now we can create our model. And so to create our model we have to create a model data object as we always do with FastAI. So a column model data object is just a model data object that represents a training set, a validation set, and an optional test set of standard column structured data. And we just have to tell it which of the variables should we treat as categorical. And then pass in our data friends. So for each of our categorical variables, here is the number of categories it has. So for each of our embedding matrices, this tells us the number of rows in that embedding matrix. And so then we define what embedding dimensionality we want. If you're doing natural language processing, then the number of dimensions you need to capture all the nuance of what a word means and how it's used has been found empirically to be about 600. It turns out that when you do NLP models with embedding matrices that are smaller than 600, you don't get as good a result as you do if they're the size 600. Beyond 600, it doesn't seem to improve much. I would say that human language is one of the most complex things that we model. So I wouldn't expect you to come across many, if any, categorical variables that need embedding matrices with more than 600 dimensions. At the other end, some things may have pretty simple kind of causality. So for example, let's have a look. State holiday. Maybe if something's a holiday, then it's just a case of, okay, at stores that are in the city, there's some behavior, there's stores that are in the country, there's some other behavior, and that's about it. Maybe it's a pretty simple relationship. So ideally, when you decide what embedding size to use, you would kind of use your knowledge about the domain to decide how complex is the relationship and so how big embedding do I need. In practice, you almost never know that. You would only know that because maybe somebody else has previously done that research and figured it out, like in NLP. So in practice, you probably need to use some rule of thumb. And then having tried your rule of thumb, you could then maybe try a little bit higher and a little bit lower and see what helps. So it's kind of experimental. So here's my rule of thumb. My rule of thumb is look at how many discrete values the category has, i.e. the number of rows in the embedding matrix, and make the dimensionality of the embedding half of that. So for day of week, which is the second one, eight rows and four columns. So here it is there. The number of categories divided by two. But then I say, don't go more than 50. So here you can see for store, there's a thousand stores. They only have a dimensionality of 50. Why 50? I don't know. It seems to have worked okay so far. You may find you need something a little different. Actually, for the Ecuadorian groceries competition, I haven't really tried playing with this, but I think we may need some larger embedding sizes. But it's something to fiddle with. Prince, can you pass that left? So as your variables, the cardinality size becomes larger and larger. You're creating more and more, like, wider embedding matrices. Aren't you, therefore, massively risking overfitting because you're just introducing so many parameters that the model can never possibly capture all that variation unless your data is absolutely huge? That's a great question. And so let me remind you about my kind of, like, golden rule of the difference between modern machine learning and old machine learning. In old machine learning, we control complexity by reducing the number of parameters. In modern machine learning, we control complexity by regularization. So the short answer is no. I'm not concerned about overfitting because the way I avoid overfitting is not by reducing the number of parameters, but by increasing my dropout or increasing my weight decay. Okay? Now, having said that, there's no point using more parameters for a particular embedding than I need, because regularization is penalizing a model by giving it more random data or by actually penalizing weights. So we'd rather not use more than we have to. But the kind of my general rule of thumb for designing an architecture is to be generous on the side of the number of parameters. But yeah, in this case, if after doing some work, we kind of felt like, you know what? The store doesn't actually seem to be that important. Then I might manually go and change this to make it smaller. Or if I was really finding there's not enough data here, I'm either overfitting or I'm using more regularization with, again, then you might go back. But I would always start with being generous with parameters. And yeah, in this case, this model turned out pretty good. Okay? So now we've got a list of tuples containing the number of rows and columns of each of our embedding matrices. And so when we call get learner to create our neural net, that's the first thing we pass in. How big is each of our embeddings? Okay? And then we tell it how many continuous variables we have. We tell it how many activations to create for each layer. And we tell it what dropout to use for each layer. Okay? And so then we can go ahead and call fit. Okay? So then we fit for a while and we're kind of getting something around the point one mark. All right. So I tried running this on the test set, and I submitted it to Kaggle during the week, actually last week. And here it is. Okay? Private score 107. Public score 103. Okay? So let's have a look and see how that would go. So 107, private 103 public. So let's start on public, which is 103. Not there. Out of 3,000. Gotta go back a long way. Here it is. 103. Okay. 340th. Ah. That's not good. So on the public leaderboard, 340th. Let's try the private leaderboard, which is 107. Oh. Fifth. So like hopefully you're now thinking, oh, there are some Kaggle competitions finishing soon, which I entered, a lot of time trying to get good results on the public leaderboard. I wonder if that was a good idea. And the answer is no, it won't. The Kaggle public leaderboard is not meant to be a replacement for your carefully developed validation set. So for example, if you're doing the iceberg competition, which ones are ships, which ones are icebergs, then they've actually put something like 4,000 synthetic images into the public leaderboard, and none into the private leaderboard. Okay. So, this is one of the really good kind of things that tests you out on Kaggle, is like, are you creating a good validation set and are you trusting it? Because if you're trusting your leaderboard feedback more than your validation feedback, then you may find yourself in 350th place when you thought you were in fifth. Right? So in this case, we actually had a pretty good validation set, because as you can see, it's saying somewhere around 0.1, and we actually did get somewhere around 0.1. Okay? And so in this case, the validation set, the public leaderboard in this competition was entirely useless. Yep. Can you use the box, please? So in regards to that, how much does the top of the public leaderboard actually correspond to the top of the private leaderboard? Because in the churn prediction challenge, there's like four people who are just completely above everyone else. It totally depends, you know? Like, if they randomly sampled the public and private leaderboard, then it should be extremely indicative. Right? But it might not be. Right? So in this case... Okay, it's crushed. Oh, here it comes. So in this case, the person who was second on the public leaderboard did end up winning. SDNT came seventh. Right? So, in fact, you can see the little green thing here, right? Whereas this guy jumped 96 places. If we had entered with the neural net, we just looked at, we would have jumped 350 places. So it, yeah, it just depends. And so often, like, you can figure out whether the public leaderboard... Like, sometimes they'll tell you the public leaderboard was randomly sampled. Sometimes they'll tell you it's not. Generally, you have to figure it out by looking at the correlation between your validation set results and the public leaderboard results to see how well they're correlated. Sometimes if, like, two or three people are way ahead of everybody else, they may have found some kind of leakage or something like that. Like, that's often a sign that there's some trick. Okay. So that's Rossman. And that brings us to the end of all of our material. So let's come back after the break and do a quick review. And then we will talk about ethics and machine learning. So let's come back in five minutes. So we've learnt two ways to train a model. One is by building a tree and one is with SGD. Okay? And so the SGD approach is a way we can train a model which is a linear model or a stack of linear layers with nonlinearities between them. Whereas tree building specifically will give us a tree. And then tree building we can combine with bagging to create a random forest or with boosting to create a GBM or various other slight variations such as extremely randomised trees. So it's worth like finding ourselves of what these things do. So let's look at some data. So if we've got some data like so actually let's look specifically a categorical data. Okay. So categorical data there's a couple of possibilities of what categorical data might look like. Okay. So let's say we've got zip code like so we've got line four 003 is our zip code, right? And then we've got like sales. Right? And it's like 50 and like nine four one three one sales of 22 and so forth. Right? So we've got some categorical variable. So there's a couple of ways we could represent that categorical variable. One would be just to use the number. Right? And like maybe it wasn't a number to start you know maybe it wasn't a number at all maybe a categorical variable is like San Francisco, New York, Mumbai and Sydney. Right? But we can turn it into a number just by like arbitrarily deciding to give them numbers. Right? So like it ends up being a number. So we could just use that kind of arbitrary number. So if it turns out that zip codes that are numerically next to each other have somewhat similar behavior then the zip code versus sales chart might look something like this for example. Right? Or alternatively if the zip code versus sales sorry if the two zip codes next to each other didn't have in any way similar sales behavior you would expect to see something that looked more like this. Like kind of just all over the place. Right? Okay? So there are kind of two possibilities. So what a random forest would do if we had just encoded zip in this way is it's going to say alright I need to find my single best split point. Okay? The split point that's going to make the two sides have as small a standard deviation as possible or mathematically equivalently have the lowest root means whatever. So in this case it might pick here as our first split point because on this side there's one average and on the other side there's the other average. Okay? And for its second split point it's going to say okay how do I split this and it's probably going to say I would split here. Right? Because now we've got this average versus this average. Right? And then finally it's going to say okay how do we split here and it's going to say okay I'll split there. Right? So now I've got that average and that average. Okay? So you can see that we're able to kind of hone in on the set of splits it needs even though it kind of does it greedily top down one at a time. Right? The only reason it wouldn't be able to do this is if like it was just such bad luck that the two halves were kind of always exactly balanced. Right? But even if that happens it's not going to be the end of the world because it'll split on something else some other variable and next time around you know it's very unlikely that it's still going to be exactly balanced in both parts of the tree. Right? So in practice this works just fine. In the second case it can do exactly the same thing. Right? It'll say like okay which is my best first split. Right? Even though there's no relationship between one zip code and it's neighbouring zip code numerically we can still see here if it splits here. Right? There's the average on one side and the average on the other side is probably about here. Right? And then where would it split next? Probably here. Right? Because here's the average on one side here's the average on the other side. Right? So again can do the same thing. Right? It's going to need more splits because it's going to end up having to kind of narrow down on each individual large zip code and each individual small zip code but it's still going to be fine. Okay? So when we're dealing with building decision trees for random forests or GBMs or whatever we tend to encode our variables just as ordinals. Okay? On the other hand if we're doing a neural network or like the simplest version like a linear regression or a logistic regression the best it could do is that. Right? Which is no good at all and ditto with this one. It's going to be like that. Okay? So an ordinal is not going to be a useful encoding for a linear model or something that stacks linear and nonlinear models together. So instead what we do is we create a one-hot encoding. Right? So we'll say like you know there's zero one zero zero zero here's zero one oh oh here's oh oh one oh oh oh one. Okay? And so with that encoding it can effectively create like a little histogram. Right? Where it's going to have a different coefficient for each level. Okay? And so that way it can do exactly what it needs to do. Can you pass that back please? At what point does that become like too tedious for your system? Or does it not? Pretty much never. Okay. Yeah. Because remember in real life we don't actually actually we don't actually have to create that matrix. Instead we can just you know have the four coefficients. Right? And just do an index lookup to grab the second one which is mathematically equivalent to multiply by the one-hot encoding. Okay? So that's no problem. One thing to mention you know I know you guys have kind of been taught quite a bit of more like analytical solutions to things and in analytical solutions to like a linear regression you get you can't solve something with this amount of collinearity. In other words Sydney you know something is Sydney if it's not Mumbai or New York or San Francisco. In other words there's a hundred percent collinearity between the fourth of these classes versus the other three and so if you try to solve a linear regression analytically that way the whole thing falls apart. Now note with SGD we have no such problem. Okay? Like SGD why would it care? Right? We're just taking one step along the derivative. It cares a little right? Because like in the end the main problem with collinearity is that there's an infinite number of equally good solutions. Right? So in other words we could increase all of these and decrease this or decrease all of these and increase this and they're going to balance out. Right? And when there's an infinitely large number of good solutions it means there's a lot of kind of flat spots in the loss surface and it can be harder to optimize. So it's a really easy way to get rid of all of those flat spots which is to add a little bit of regularization. Add a little bit of a little bit of weight decay like 1 e nix 7 even and that basically says these are not all equally good anymore. The one which is the best is the one where the parameters are the smallest and the most similar to each other and so that'll again move it back to being a nice loss function. Yes? Could you just clarify that point you made about Y1 hard coding wouldn't be that tedious? Sure. If we have a one-hot encoded vector and we are multiplying it by a set of coefficients then that's exactly the same thing as simply saying let's grab the thing where the one is. So in other words if we had stored this as a 0 and this one as a 1 and this one as a 2 then it's exactly the same as just saying hey look up that thing in the array. And so we call that version an embedding, right? So an embedding is a weight matrix you can multiply by a one-hot encoding and it's just a computational shortcut but it's mathematically the same. So there's a key difference so the first key difference is between like solving linear type models analytically versus with SGD, with SGD we don't have to worry about collinearity and stuff or at least not nearly to the same degree. And then the difference between solving a linear or single layer or multi-layer model with SGD versus a tree a tree is going to be like it's going to complain about less things, right? So in particular you can just use ordinals as your categorical variables and as we learnt just before we also don't have to worry about normalizing continuous variables for a tree but we do have to worry about it for these SGD trained models. So then we also learnt a lot about interpreting random forests in particular and if you're interested you may be interested in trying to use those same techniques to interpret neural nets, right? So if you want to know which of my features are important in a neural net you could try the same thing try shuffling each column in turn and see how much it changes your accuracy, okay? And that's going to be your feature importance for your neural net. And then if you really want to have fun recognize then that shuffling that column is just a way of calculating how sensitive the output is to that input which in other words is the derivative of the output with respect to that input and so therefore maybe you could just ask PyTorch to give you the derivatives with respect to the input directly and see if that gives you the same kind of answers, right? You could do the same kind of thing for a partial dependence plot you could try doing the exact same thing with your neural net replace everything in a column with the same value do it for 1960, 1961, 1962 plot that, right? I don't know of anybody who's done these things before not because it's rocket science but just because I don't know maybe no one thought of it or it's not in a library, I don't know but if somebody tried it I think you should find it useful make a great blog post, maybe even a paper if you wanted to take it a bit further so there's a thought that something could do so most of those interpretation techniques are not particularly specific to random forests things like the tree interpreter because they're all about what's inside the tree can you pass it to Karim? We are applying tree interpreter for neural nets how are we going to make inference out of activations that the path follows for example how are we going to like in tree interpreter we are like we're looking at the paths we are looking at the paths and their contributions of the features in this case it will be the same with activations I guess the contributions of each activation and their path I don't know, I haven't thought about it How can we like make inference out of the activations So I'd be careful to say the word inference because people normally use the word inference specifically to mean the same as like a test time prediction you make some kind of interrogate the model yeah not sure we should think about that actually Hinton and one of his students just published a paper on how to approximate a neural net with a tree for this exact reason which I haven't read the paper yet could you pass that So in linear regression and traditional statistics one of the things that we focused on was statistical significance of like the changes in things like that and so in thinking about a tree interpreter or even like the waterfall chart which I guess is just a visualization I guess where does that fit in like because we can see like oh yeah this looks important in the sense that it causes large changes but how do we know that it's like traditionally statistically significant or anything of that sort So most of the time I don't care about the traditional statistical significance and the reason why is that nowadays the main driver of statistical significance is data volume not kind of practical importance and nowadays most of the models you build will have so much data that every tiny thing will be statistically significant but most of them won't be practically significant so my main focus therefore is practical significance which is does the size of this influence impact your business you know statistical significance only you know like it was much more important when we had a lot less data to work with if you do need to know statistical significance because for example you have a medical data set because it's like really expensive to label or hard to collect or whatever or it's a medical data set for a rare disease you can always get statistical significance by bootstrapping which is to say that you can randomly resample your data set a number of times train your model a number of times and you can then see the actual variation in predictions so that's with bootstrapping you can model into something that gives you confidence intervals there's a paper by Michael Jordan which has a technique called the bag of little bootstraps which actually kind of takes this a little bit further and well worth reading if you're interested can you pass it to Prince so you said we don't need one hot encoding matrix if you're doing random forest or if you're doing any tree based models what will happen if we do that and how bad can it be if you do do one hot encoding for random forest we actually did do it remember we had that maximum category size and we did create one hot encodings and the reason why we did it was that then our feature importance would tell us the importance of the individual levels and our partial dependence plot we could include the individual levels so it doesn't necessarily make the model worse it may make it better but it probably won't change it much at all in this case it hardly changed it this is something that we have noticed on real data also that if cardinality is higher let's say 50 levels and if you do one hot encoding the random forest performs very badly yeah that's right if the cardinality that's why in fast ai we have that maximum categorical size you know because at some point your one hot encoded variables become too sparse so I generally like cut it off at 6 or 7 also because like when you get past that it becomes less useful because the feature importance there's going to be too many levels to really look at so can it not just not look at those levels which are not important and just gives those significant features as important? yeah it'll be okay you know it's just like once the cardinality increases too high you're just splitting your data up you know too much basically and so in practice your ordinal version is likely to be better okay so yeah there's no time to kind of review everything but I think that's the kind of key concepts and then of course remembering that the embedding matrix that we can use is likely to have more than just one coefficient we'll actually have a dimensionality of a few coefficients which isn't going to be useful for most linear models but once you've got multi-layer models that's now creating a representation of your category which is kind of quite a lot richer and you can do a lot more with it let's now talk about the most important bit we started off early in this course talking about how actually a lot of machine learning is kind of misplaced people focus on predictive accuracy like Amazon has a collaborative filtering algorithm for recommending books and they end up recommending the book which it thinks you're most likely to rate highly and so what they end up doing is probably recommending a book that you already have or that you already know about and would have bought anyway which isn't very valuable what they should instead have done which book can I recommend that would cause you to change your behavior and so that way we actually maximize our lift in sales due to recommendations and so this idea of like the difference between optimizing influencing your actions versus just kind of improving predictive accuracy improving predictive accuracy is a really important distinction which is like very rarely discussed in academia or industry kind of crazily enough it's more discussed in industry it's particularly ignored in most of academia so it's a really important idea which is that in the end the idea, the goal of your model presumably is to influence behavior and remember I actually mentioned a whole paper I have about this where I introduce this thing called the drive train approach where I talk about ways to think about how to incorporate machine learning into how do we actually influence behavior so that's a starting point but then the next question is like if we're trying to influence behavior what kind of behavior should we be influencing and how and what might it mean when we start influencing behavior because nowadays like a lot of the companies that you're going to end up working at are big ass companies and you'll be building stuff that can influence millions of people right so what does that mean so I'm actually, I'm not going to tell you what it means because like I don't know all I'm going to try and do is make you aware of some of the issues right and make you believe two things about them first that you should care right and second that they're big current issues the main reason I want you to care is because I want you to want to be a good person and show you that like not thinking about these things will make you a bad person but if you don't find that convincing I will tell you this Volkswagen were found to be cheating on their emissions tests the person who was sent to jail for it was the programmer that implemented that piece of code they did exactly what they were told to do right and so if you're coming in here thinking hey I'm just a techie you know I'll just do what I'm told right that's that's my job is to do what I'm told I'm telling you if you do that you can be sent to jail for doing what you're told okay so so A don't just do what you're told because you can be a bad person and B you can go to jail okay second thing to realize is in the heat of the moment you're in a meeting with 20 people at work and you're all talking about how you're going to implement you know this new feature and everybody's discussing it and everybody's like we could do this and here's a way of modeling it and then we can implement it and here's these constraints and there's some part of you that's thinking right that's not the right time to be thinking about that because it's really hard to like step up then and say excuse me I'm not sure this is a good idea you actually need to think about how you would handle that situation ahead of time right so I want you to like think about about these issues now right and realize that by the time you're in the middle of it right you might not even realize it's happening you know like it'll just be a meeting like every other meeting and a bunch of people will be talking about how to solve this technical question and you need to be able to recognize like oh this is actually something with ethical implications so Rachel actually wrote all of these slides I'm sorry she can't be here to present this because like she's studied this in depth and you know she's actually been in difficult environments herself where she's kind of seen these things happening you know and we know how hard it is right but let me give you a sense of like what happens right so so engineers trying to solve engineering problems is you know and causing problems is not a new thing right so in Nazi Germany IBM known as Hollerith Hollerith was the original name of IBM and it comes from the guy who actually invented the use of punch cards for tracking the US census the first mass wide scale use of punch cards for data collection in the world right and that turned into IBM and so at this point this unit at least was still called Hollerith so Hollerith sold a punch card system to Nazi Germany and so each punch card would like code you know this is a Jew, 8, Gypsy, 12 general execution 4 death by gas chamber, 6 and so here's one of these cards describing the right way to kill these various people right and so a Swiss judge ruled that IBM's technical assistance facilitated the tasks of the Nazis in commission of their crimes against humanity this led to the death of something like 20 million civilians so according to the Jewish virtual library where I got these pictures and quotes from their view is that the destruction of the Jewish people became even less important because of the invigorating nature of IBM's technical achievement only heightened by the fantastical profits to be made right so this was a long time ago and you know hopefully you won't end up working at companies that facilitate genocide right but perhaps you will right because perhaps you'll go to Facebook who are facilitating genocide right now right and I know people at Facebook who are doing this and they had no idea they were doing this right so right now in Facebook the Rohingya are in the middle of a genocide a Muslim population of Myanmar babies are being grabbed out of their mother's arms into fires people are being killed hundreds of thousands of refugees when interviewed the Myanmar generals doing this say we are so grateful to Facebook for letting us know about the Rohingya fake news the words they use the Rohingya fake news that these people are actually not human that they're actually animals right now Facebook did not set out to enable the genocide of the Rohingya people in Myanmar no instead what happened is they wanted to maximize impressions and clicks right and so it turns out that for the data scientists at Facebook their algorithms kind of learned that if you take the kinds of stuff people are interested in and feed them slightly more extreme versions of that you're actually going to get a lot more impressions and the project managers are saying maximize these impressions and people are clicking it creates this this thing and so the potential implications are extraordinary and global and this is something that is literally happening you know this is October 2017 it's happening now could you pass that back there so I just want to clarify what was happening here so it was the facilitation of like fake news or like inaccurate media yeah so what happened was let me go into it in more detail so what happened was in mid 2016 Facebook fired its human editors so it was humans that decided how to order things on your homepage those people got fired and replaced with machine learning algorithms and so the machine learning algorithms written by data scientists like you you know they had nice clear metrics and they were trying to maximize their predictive accuracy and be like okay we think if we put this thing higher up than this thing we'll get more clicks okay and so it turned out that these algorithms for putting things on the Facebook newsfeed had a tendency to say like oh human nature is that we tend to click on things which like stimulate our views and therefore like more extreme versions of things we already see okay so so this is great for the kind of Facebook revenue model of maximizing engagement it looked good on all of their KPIs and so at the time you know there was some negative press about like you know I'm not sure that the stuff that Facebook is now putting on their trending section is actually that accurate but from the point of view of the metrics that people were optimizing at Facebook it looked terrific and so way back to October 2016 people started noticing some serious problems for example it is illegal to target housing to people of certain races in America that is illegal and yet a news organization discovered that Facebook was doing exactly that in October 2016 again not because somebody in that data science team said like let's make sure black people can't live in nice neighborhoods right but instead you know they found that their automatic clustering and segmentation algorithm found there was a cluster of people who didn't like African Americans and that if you targeted them with these kinds of ads then they would be more likely to select this kind of housing or whatever right but the interesting thing is that even after being told about this three times Facebook still hasn't fixed it right and that is to say these are not just technical issues they're also economic issues right when you start saying like the thing that you get paid for that is ads you have to change the way that you structure those so that you know you either use more people that cost money or you like a less aggressive on your algorithms to target people you know based on like minority group status or whatever you know that can impact revenues and so the reason I mention this is you will at likely at some point in your career find yourself in a conversation where you're thinking I'm not confident that this is like morally okay the person you're talking to is thinking in their head this is going to make us a lot of money and you just you don't quite ever manage to have a successful conversation because you're talking about different things and so when you're talking to somebody who may be more experienced and more senior than you and they may sound like they know what they're talking about just realize that their incentives are not necessarily going to be focused on like how do I be a good person like they're not thinking how do I be a bad person but the more time you spend an industry in my experience the more desensitized you kind of get to this stuff of like okay maybe getting promotions and making money isn't the most important thing right so for example I've got a lot of friends who are very good at computer vision and some of them have gone on to create startups that seem like they're almost hand made to help authoritarian governments surveil their you know their citizens and when I ask my friends like have you thought about how this could be used in that way you know they're generally kind of offended that I ask you know but I'm asking you to think about this like you know wherever you end up working if you end up creating a startup like tools can be used for good or for evil right and so I'm not saying like don't create excellent object tracking and detection tools from computer vision because yeah you could go on and use that to create like a much better surgical intervention robot toolkit right I'm just saying like be aware of it, think about it, talk about it you know so here's one I find like fascinating and there's this really cool thing actually that meetup.com did, this is from a meetup.com talk that's online they think about this they actually thought about this, they actually thought if we built a collaborative filtering system like we learned about in class to help people decide what meetup to go to it might notice that on the whole in San Francisco a few more men than women tend to go to techie meetups and so it might then start to decide to recommend techie meetups to more men than women as a result of which more men will go to techie meetups as a result of which when women go to techie meetups they'll be like oh this is all men I don't really want to go to techie meetups as a result of which the algorithm will get new data saying that men like techie meetups better right and so it continues right and so like a little a little bit of kind of that initial push from the algorithm can create this runaway feedback loop and you end up with like almost all my old techie meetups for instance right and so this kind of feedback loop is a kind of subtle issue that you really want to think about when you're thinking about like what is the behavior that I'm changing with this algorithm that I'm building so another example which is kind of terrifying is in this paper where the authors describe how a lot of departments in the US are now using predictive policing algorithms right so where can we go to find somebody who's about to commit a crime and so you know that the algorithm simply feeds back to you basically the data that you've given it right so if your police department has engaged in racial profiling at all in the past then it might suggest slightly more often maybe you should go to the black neighborhoods to check for people committing crimes as a result of which more of your police officers go to the black neighborhoods as a result of which they arrest more black people as a result of which the data says that the black neighborhoods are less safe as a result of which the algorithm says to the policeman maybe you should go to the black neighborhoods more often and so forth right and this is not like you know vague possibilities of something that might happen in the future. This is like documented work from top academics who have carefully studied the data and the theory, right? This is like serious scholarly work, it's like, no, this is happening right now. And so, you know, again, like, I'm sure the people that started creating this predictive policing algorithm didn't think like, how do we arrest more black people, right? You know, hopefully they were actually thinking, gosh, I'd like my children to be safer on the streets. How do I create, you know, a safer society, right? But they didn't think about this nasty runaway feedback loop. So actually, this one about social network algorithms is actually an article in the New York Times recently about one of my friends, Renee DeResta, and she did something kind of amazing. She set up a second Facebook account, right, like a fake Facebook account. And she was very interested in the anti-vaxx movement at the time. So she started following a couple of anti-vaxxers and visited a couple of anti-vaxxer links. And so suddenly her news feed starts getting full of anti-vaxxer news along with other stuff like chemtrails and deep state conspiracy theories and all this stuff. And so she's like, huh, starts clicking on those, right? And the more she clicked, the more hardcore, far out conspiracy stuff Facebook recommended. So now when Renee goes to that Facebook account, the whole thing is just full of angry, crazy, far out conspiracy stuff. Like that's all she sees. And so if that was your world, right, then as far as you're concerned, it's just like this continuous reminder and proof of all this stuff, right? And so again, it's like this, this is, to answer your question, this is the kind of runaway feedback loop that ends up telling me and my generals, you know, throughout their Facebook homepage that Rohingya are animals and fake news and whatever else, right? So you know, it's a lot of this comes from also from bias, right? And so like, let's talk about bias specifically. So bias in image software comes from bias in data. And so most of the folks I know at Google Brain, building computer vision algorithms, very few of them are people of color. And so when they're training the algorithms with photos of their families and friends, they are training them with very few people of color. And so when FaceApp then decided we're going to try looking at lots of Instagram photos to see which ones are like, no, up voted the most without them necessarily realizing it. The answer was like, you know, light colored faces. So then they built a generative model to make you more hot. And so this is the actual photo. And here is the hotter version, right? So the hotter version is like more white, less nostrils, you know, more European looking, right? So like, this did not go down well, to say the least. So like, so again, you know, I don't think anybody at FaceApp said like, let's create something that makes people look more white, right? They just trained it on a bunch of images of the people that they had around them, okay? And this has kind of, you know, serious commercial implications as well. They had to pull this feature, right? And they had a huge amount of negative pushback, like as they should, right? Here's another example. Google Photos created this photo classifier, airplanes, skyscrapers, cars, graduation, and oh, gorillas, right? So like, think about how this looks to like most people. Like most to most people, they look at this, they don't know about machine learning. They say, what the fuck? Somebody at Google wrote some code to take black people and call them gorillas. Like that's what it looks like, right? Now we know that's not what happened, right? We know what happened is, you know, the team, you know, of folks at Google computer vision experts who have none if or few people of color working in the team, built a classifier using all the photos they had available to them. And so when the system came across, you know, a person with dark skin, it was like, oh, I've only mainly seen that before amongst gorillas. So I'll put it in that category, right? So again, it's the bias in the data creates a bias in the software. And again, the commercial implications were very significant. Like Google really got a lot of bad PR from this as they should. This was a photo that somebody put in their Twitter feed. They said, like, look what Google photos just decided to do. You can imagine what happened with the first international beauty contest judged by artificial intelligence. Right? Basically, it turns out all the beautiful people are white again, right? So like, you kind of see this bias in image software, thanks to bias in the data, thanks to by lack of diversity in the team is building it, you see the same thing in natural language processing. Right? So here is Turkish. Oh, is the the pronoun in Turkish, which has no gender, right? There is no he or versus she, right? No, no, he versus she. But of course, in English, we don't really have a widely used ungendered singular pronouns. So Google translate converts it to this. Okay? Well, there are plenty of people who saw this online and said, like, literally, so what? You know, it is correctly feeding back the usual usage in English. Like this is, you know, it's it I know how this is trained. This is like word to beck vectors. I was trained on Google news corpus, Google books corpus. It's just telling us how things are. And like, from a point of view, that's entirely true. Right? The biased data to create this biased algorithm is the actual data of how people have written books and newspaper articles for decades. But does that mean that this is the product that you want to create? You know, does this mean this is the product you have to create? Right? Just because the particular way you've trained the model means it ends up doing this, you know, is this actually the design you want? And can you think of potential negative implications and feedback loops this could create? Right? And, you know, if any of these things bother you, then now, if lucky you, you have a new call engineering problem to work on, like, how do I create unbiased NLP solutions? And now there are some startups starting to do that and starting to make some money. Right? So like, you know, these are opportunities for you. It's like, Hey, here's some stuff where people are creating screwed up societal outcomes because of their shitty models, like, okay, well, you can go and build something better. Right? So like another example of the bias in word-to-vec word vectors is restaurant reviews rank Mexican restaurants worse because Mexican, the Mexican words tend to be associated with criminal words in the US press and books more often. Again, this is like a real problem that is happening right now. So, you know, Rachel actually did some interesting analysis of just the plain word-to-vec word vectors where she basically pulled them out and, you know, looked at these analogies based on some research that had been done elsewhere. And so you can see, like, word-to-vec, like, the vector directions show that father is the doctor, is the mother is the nurse, man is the computer programmer, as woman is the homemaker, and so forth. Right? So it's really easy to see what's in these word vectors. And, you know, they're kind of fundamental to much of the NLP or probably just about all of the NLP software we use today. So like here's a great example. So a pro-public has actually done a lot of good work in this area. Judges, many judges now have access to sentencing guidelines software. And so sentencing guidelines software says to the judge, for this individual, we would recommend this kind of sentence, right? And now, of course, a judge doesn't understand machine learning. So like they have two choices, which is either do what it says or ignore it entirely, right? And some people fall into each category, right? And so for the ones that fall into the do what it says category, here's what happens. For those that were labeled higher risk, right, the subset of those that labeled higher risk but it actually turned out not to re-offend, right, was about a quarter of whites and about a half of African Americans, right? So like nearly twice as often, right, people who didn't re-offend were marked as higher risk if they were African American. And vice versa, amongst those that labeled lower risk but actually did re-offend turned out to be about half of the whites and only 28% of the African Americans. So like this is data which I would like to think nobody is setting out to create something that does this, right? But when you start with biased data, right, and the data says that whites and blacks smoke marijuana at about the same rate but blacks are jailed at, I think it's something like five times more often than whites, like, you know, the nature of the justice system in America at least at the moment is that it's not equal, it's not fair. And therefore the data that's fed into the machine learning model is going to basically support that status quo. And then because of the negative feedback loop, it's just going to get worse and worse, right? I'll tell you something else interesting about this one, which Research Court Abe Gong has pointed out, is here are some of the questions that are being asked, right? So let's take one. Was your father ever arrested, right? So your answer to that question is going to decide whether you're locked up and for how long. Now, as a machine learning researcher, do you think that might improve the predictive accuracy of your algorithm and get you a better R-squared? It could well, right? I don't know. You know, maybe it does. We try it out and say, oh, I've got a better R-squared. So does that mean you should use it? Like, well, there's another question, like, do you think it's reasonable to lock somebody up for longer because of who their dad was? And yet these are actually the examples of questions that we are asking right now to offenders and then putting into a machine learning system to decide what happens to them. Okay? So again, like, whoever designed this, presumably they were like laser focused on technical excellence, getting the maximum area under the ROC curve. And I found these great predictors that give me another .02, right? And I guess didn't stop to think, like, well, is that a reasonable way to decide who goes to jail for longer? So like putting this together, you can kind of see how this can get, you know, more and more scary. We take a company like Taser, right? And Taser's are these devices that kind of give you a big electric shock, basically. And Taser's managed to do a great job of creating strong relationships with some academic researchers who seem to say whatever they tell them to say. To the extent where now, if you look at the data, it turns out that there's a much higher, you know, there's a pretty high probability that if you get tased, that you will die. That happens, you know, not unusually. And yet, you know, the researchers who they've paid to look into this have consistently come back and said, oh no, it was nothing to do with the Taser. The fact that they died immediately afterwards was totally unrelated. It's just a random, you know, things happen. So this company now owns 80% of the market for body cameras. And they started buying computer vision AI companies. And they're going to try and now use these police body camera videos to anticipate criminal activity. Right? And so like, what does that mean? Right? So is that like, okay, I now have some augmented reality display saying like, Taze this person because they're about to do something bad, you know? So it's like, it's kind of like a worrying direction. And so, you know, I'm sure nobody who's a data scientist at Taser or at the companies that they bought out is thinking like, you know, this is the world I want to help create. But they could find themselves in, you know, or you could find yourself in the middle of this kind of discussion where it's not explicitly about that topic, but there's part of you that says like, I wonder if this is how this could be used. Right? And, you know, I don't know exactly what the right thing to do in that situation is because like you can ask and of course people are going to be like, no, no, no, no. So it's like, you know, are you going to, you know, what could you do? No, you could like ask for some kind of written promise. You could decide to leave. You could, you know, start doing some research into the legality of things to say like, oh, I would at least protect my own, you know, legal situation. I don't know, like, have a think about how you would respond to that. So these are some questions that Rachel created as being things to think about, right? So if you're looking at building a data product or, you know, using a model, like if you're building a machine learning model, is for a reason. Okay, you're trying to do something, right? So what bias may be in that data, right? Because whatever bias is in that data ends up being a bias in your predictions. Potentially then biases the actions that you're influencing. Potentially then biases the data that you come back and you may create a feedback loop. If the team that built it isn't diverse, you know, what might you be missing? So for example, one senior executive at Twitter called the alarm about major Russian bot problems at Twitter way back well before the election. That was the one black person in the exact team at Twitter, the one. And shortly afterwards, they lost their job, right? And so like, definitely having a more diverse team means having a more diverse set of opinions and beliefs and ideas and things to look for and so forth. So non-diverse teams seem to make more of these bad mistakes. Can we order the code? Is it open source? Check for the different error rates amongst different groups. Is there like a simple rule we could use instead that's like extremely interpretable and easy to communicate? And like, you know, if something goes wrong, do we have a good way to deal with it? So when we've talked to people about this and a lot of people like have come to Rachel and said like, I'm concerned about something my organization is doing. You know, what do I do? Or I'm just concerned about my toxic workplace. What do I do? And very often, you know, Rachel will say like, well, have you considered leaving? And they will say, oh, I don't want to lose my job. But actually, if you can code, you're in like 0.3% of the population. If you can code and do machine learning, you're in probably like 0.01% of the population. You are massively, massively in demand. So like realistically, you know, obviously it's an organization does not want you to feel like you're somebody who could just leave and get another job that's not in their interest. But that is absolutely true. And so one of the things I hope you leave this course with is enough self-confidence to recognize that you have the skills to get a job. And particularly, once you've got your first job, your second job is an order of magnitude easier. And so this is important not just so that you feel like you actually have the ability to act ethically, but it's also important to realize, like if you find yourself in a toxic environment, which is pretty damn common, unfortunately, like there's a lot of shitty tech cultures, environments, particularly in the Bay Area. If you find yourself in one of those environments, the best thing to do is to get the hell out. And if you don't have the self-confidence to think you can get another job, you can get trapped. So it's really important to know that you are leaving this program with very in-demand skills. And particularly after you have that first job, you're now somebody with in-demand skills and a track record of being employed in that area. OK, great. So yes. This is kind of just a broad question. But what are some things that you know of that people are doing to treat bias in data? You know, it's kind of like a bit of a controversial subject at the moment. And there are like people are trying to use, some people try to use an algorithmic approach, where they're basically trying to say, how can we identify the bias and kind of like subtract it out? But like the most effective ways I know of are ones that are trying to treat it at the data level. So like start with a more diverse team, particularly a team involved, which includes people from the humanities, like sociologists, psychologists, economists, people that understand feedback loops and implications for human behavior. And they tend to be equipped with, you know, good tools for kind of identifying and tracking these kinds of problems. And so, and then kind of trying to incorporate the solutions into the process itself. Let's say there isn't kind of like a, you know, some standard process I can point you to and say, here's how to solve it. You know, if there is such a thing, we haven't found it yet. You know, it requires a diverse team of smart people to be aware of the problems and work hard at them is the short answer. Can you pass that back, please? This is just kind of a general thing, I guess, for the whole class. If you're interested in the stuff that I read a pretty cool book, Jeremy, you've probably heard of it, Weapons of Math Destruction by Kathy O'Neill. It covers a lot of the same stuff, just more on the topics. Yeah, thanks for the recommendation. So Kathy's great. She's also got a TED talk. I didn't manage to finish the book because it's so damn depressing. I was just like, yeah, no more. But yeah, it's very good. All right. Well, that's it. Thank you, everybody. You know, this has been really intense for me. You know, obviously this was meant to be something that I was sharing with Rachel. So I've, you know, ended up doing one of the hardest things in my life, which is to teach two people's worth of course on my own and also look after a sick wife and have a toddler and also do a deep learning course and also do all this with a new library that I just wrote. So I'm looking forward to getting some sleep, but it's been totally worth it because you've been amazing. Like I'm thrilled with how you've, you know, reacted to the kind of, you know, the opportunities I've given you and also to the feedback that I've given you. So congratulations.