 Okay so welcome to lesson 14 the final lesson for now and you know we'll talk at the end about like what next but as you can see from what's increasingly been happening what next is very much about you with us rather than us leading you or telling you so you know we're a community now and we can figure these stuff out together and obviously USF is a one for ally to have yeah so for now this is the last of these lessons and one of the things that was great to see this week was this terrific article in Forbes that talked about deep learning education and it was written by one of our terrific students Maria and focuses on the great work of some of the students that have come through this course and yeah so I wanted to say thank you very much and congratulations on this great article I hope everybody check it out it's really beautifully written as well and terrific stories I found it quite inspiring so today we are going to be talking about a couple of things but we're going to start with kind of time series and structured data and time series I wanted to start very briefly by talking about something which I think you basically know how to do this is a fantastic paper because it is not by deep mind nobody's heard of it it actually comes from the Children's Hospital of Los Angeles and believe it or not perhaps the epicenter of practical applied AI in medicine Southern California and specifically Southern California pediatrics the Children's Hospital of Orange County Chalk and the Children's Hospital of Los Angeles CHLA CHLA which this paper comes from actually has this thing they call VPICU the virtual pediatric intensive care unit where for many many years they've been tracking every electronic signal about how every patient every kid in the hospital was treated and what all their ongoing kind of sensor readings are and one of the extraordinary things they do is when the when the doctors they do rounds data scientists come with them like I don't know anywhere in the world else in the world that this this happens and so a couple of months ago they released a draft of this amazing paper where they talked about how they pulled out all this data from the EMR and from the sensors and attempted to predict patient mortality now the reason this is interesting is that when a kid goes into the ICU if if the models start saying this kid's looking like they might die then that's that's the thing that sets the alarms going and everybody rushes over and starts you know looking after them and they found that they built a model that was more accurate than any existing model those existing models were built on many years of deep clinical input and they used an RNN. Now this kind of time series data is what I'm going to refer to as a signal time series data so let's say you've got a series of like blood pressure readings right so they might be you know they come in and their blood pressure is kind of low and it's kind of all over the place and then suddenly it shoots up right the and then you know in addition to that maybe there's other readings such as like you know at which points they receive some kind of medical intervention you know there was one here and one here and then there was like six here and so forth so these kinds of things generally speaking the the state of health at time t is probably best predicted by all of the various sensor readings at t minus one and t minus two and t minus three right so in in statistical terms we would refer to that as autocorrelation autocorrelation means correlation with previous time periods right and for this kind of signal I think it's very likely that an RNN is the way to go obviously you could probably get a better result using a bi-directional RNN but that's not going to be any help in the ICU because you don't have the future time period sensors right so be careful of this data leakage issue and indeed this is what this team at the VPCU at Children's Hospital of Los Angeles did they used an RNN to get this data the result I'm not really going to teach you more about this because basically you already know how to do it okay there's this you can check out the paper you'll see there's almost nothing special the only thing which was quite clever was that their their sensor readings were not necessarily equally spaced so for example you know did they receive some particular medical intervention clearly they're very widely spaced and they're not equally spaced so rather than having the RNN have basically a sequence of interventions that gets fed to the RNN instead they actually have two things one is the the signal and the other is the time since the last signal was read so each point at the RNN is basically just some function f it's receiving two things it's receiving the signal at time t and the value of t itself right what is the time or the difference in time what how long was it since the last one but that doesn't require any different deep learning that's just concatenating one extra thing under your vector and they actually show kind of mathematically that this makes a certain amount of sense as a way to deal with this and then they find empirically that it does actually seem to work pretty well okay so I can't tell you whether this is state of the art for anything because I just haven't seen you know deep comparative papers or any competitions or anything that really have this kind of data which is weird because a lot of the world runs on this kind of data this kind of data doing it effectively things with it is super valuable like you know if you're an oil and gas company what's the drillhead telling you or what's the you know signals coming out of the pipe telling you or so on and so forth but you know there we go it's not the kind of cool thing that the Google kids work on so it is so I'm not going to talk more about that that's how you can do time series with this kind of signal data you can also incorporate all of the other stuff we're about to talk about which is the other kind of time series data for example there was a Kaggle competition which was looking at forecasting sales well forecasting sales using for each store at this big company in Europe called Rossman based on you know the date and what motions are going on and what competitors doing so forth this kind of data is likely to look like kind of this it's likely to kind of be you know some kind of something like that or maybe it'll have or maybe it'll have some kind of trend to it right so these kinds of seasonal time series are very widely analysed by econometricians and they're they're everywhere particularly in business if you're trying to predict you know how many widgets you have to buy next month or you know whether to increase or decrease your prices or you know all kinds of operational type things tend to look like this you know how full your planes are going to be you know what whether you should add promotions so on and so forth so it turns out that the state of the art for this kind of approach is not necessarily come to use an RNN and I'm actually going to look at the third place result from this competition because the third place result was nearly as good as places one and two but way way way way simpler and also turns out that there's stuff that we can build on top of for almost every model of this kind and basically surprise surprise it turns out that the answer is to use a neural network so you know I need to warn you again what I'm going to teach you here is very very uncool you'll never read about it from DeepMind or OpenAI it doesn't involve any robot arms it doesn't involve thousands of GPUs it's the kind of boring stuff that normal companies use to like make more money or spend less money or satisfy their customers so I apologize deeply for that oversight having said that in the 25 years more I've been doing you know machine learning work applied in the real world 98% of it has been this kind of data whether it be when I was working in in agriculture I've worked in wool macadamia nuts and rice and we were figuring out you know how how full our barrels were going to be whether we needed more we were figuring out you know how to set futures markets prices for agricultural goods whatever worked in mining and brewing which you know required analyzing all kinds of engineering data and sales data I've worked in banking that required looking at kind of transaction account pricing and and risk and fraud all of these areas basically involve this kind of data so I think although like no one publishes stuff about this because you know anybody who comes out of a Stanford PhD and goes to Google doesn't know about any of those things I guess it's probably the most useful thing for the vast majority of people and you know excitingly it turns out that you don't need to learn any new techniques at all in fact the model that they got this third place result with a very very simple model is basically one where each different categorical variable was one hot encoded and chucked into an embedding layer the embedding layers were concatenated and chucked through a dense layer and then a second dense layer and then through went through a sigmoid function into an output layer so very simple the continuous variables they haven't drawn here and these are these all these pictures are going to come straight from this paper which these two the folks that came third kindly actually wrote a paper about this the continuous variables basically get fed directly into the dense layer so that's that's the structure of the model how well does it work so the short answer is compared to K nearest neighbors random forests and GBMs just a base a simple neural network beats all of those approaches just with standard kind of one hot encoding whatever but then the EE is entity embeddings so adding in this idea of using embeddings interestingly you can take the embeddings trained by a neural network and feed them into a KNN or a random forest or a GBM and in fact using embeddings with every one of those things is way better than all of those things are any of anything other than neural networks right so that's pretty interesting and then if you use the embeddings with a neural network you get the best result still so this actually is kind of fascinating because training this neural network took me you know some hours on a Titan X where else training the GBM took it was I think less than a second like it was so fast I thought I had screwed something up and then I kind of tried running it was like holy shit it's it's giving accurate predictions so GBMs and random forests are so fast so you know in in your organization you know you could try taking everything that you could think of as a categorical variable and once a month train a neural net with with embeddings and then store those embeddings in a database table and tell you know all of your business users hey anytime you want to create a model that incorporates day of week or store ID or customer ID you know you can go grab the embeddings and so they're basically like word vectors but they're customer vectors and store vectors and product vectors so I've never seen anybody write about this other than this paper and even in this paper they don't really get to the this this hugely important idea of like what you could do with these embeddings what's the difference between A and B and C is it like different data types flowing in A and B and C on the previous one yeah we're going to get to that in a moment so yeah basically the different things is like A might be the store ID B might be the product ID and C might be the day of work one of the really nice things that they did in this paper was to then draw some prediction some projections of some of these embeddings and they just used t-sne it doesn't really matter what the projection method is but here's some some interesting ideas they took each state of Germany this is a based in Germany and they did a projection of the embeddings from the state field right and here is those projections and I've drawn around them different colored circles and you might notice the different colored circles exactly correspond to the different colored circles on a map of Germany now this word just random embeddings trained with SGD trying to predict sales in stores at Rossman and yet somehow they've drawn a map of Germany so obviously the reason why is because you know things close to each other in Germany have similar behaviors around how they respond to events and you know who buys what kinds of products and so on and so forth so that's like crazy fascinating here's the kind of bigger picture every one of these dots is the distance between two stores and this shows the correlation between the distance in embedding space versus the actual distance between the stores right and so you can basically see that there's a strong correlation between things being close to each other in real life and close to each other in these SGD trained embeddings here's a couple more pictures I all the lines drawn on top of mine but everything else is just there from the paper on the left is days of the week in bedding and you can see the days of the week that are near each other have ended up embedded close together on the right is month of the year embedding again same thing okay and you can see that you know the weekend is clearly separate so that's that's where we're going to get to right and I'm actually going to take you through the end-to-end process and I rebuilt the I rebuilt the end-to-end process from scratch and tried to make it in as few lines as code as possible because you know we just haven't really looked at any of these structured data type problems before so you know it's kind of a very different process and you know even a kind of a different set of techniques alright so we import the usual stuff when you try to do this stuff yourself you'll find of three or four libraries we haven't used before so when you hit something that says module not found just a pip install you can just pip install all these things they're pure pure Python we'll talk about them as we get to them okay so the data that comes from Kaggle comes down as a bunch of CSV files and I wrote a quick thing to combine some of those CSVs together and this was one of those competitions where people were allowed to use additional external data as long as they were shared on the forum so the data I'll share with you I'm going to kind of combine it all into one place for you so I've commented these apps because the stuff I'll give you will have already run this concatenation process so the basic tables that you're going to get access to is the training set itself a list of stores a list of which state each store is in a list of each the abbreviation in the name of each state in Germany a list of data from Google Trends so like if you've used Google Trends you can basically see how particular keywords change over time I don't actually know which keywords they used but you know somebody found that there was some Google Trends keywords that correlated well so we've got access to those some information about the weather and then a test set so I'm not sure that we've really used pandas much if at all yet so let's talk a bit about pandas so pandas let's just take this kind of structured data and manipulate it in similar ways to the way you would manipulate it in a database right so the first thing you do so pandas just like numpy tends to become NP pandas tends to become PD so PD dot read CSB is going to return a data frame okay so a data frame it's like a database table if you've used R it's called the same thing so this read CSB is going to return a data frame containing the information from this CSB file okay and we're going to go through each one of those table names and read the CSB so this will end up this list comprehension is going to return a list of data frames so I can now go ahead and display the the head so the first five rows from each table and that's a good way to kind of get a sense of what these tables are okay so here's the first one is the trading set so for some store on some date they had some level of sales to some number of customers and they were either open or closed they either had a promotion on or they didn't it either was a holiday or it wasn't for state and school and then some additional information about the data so that's the basic information we have and then everything else we join on to that so for example for each store we can join up some information some kind of categorical variable about what kind of store it is I have no idea what this is it might be a different brand or something what kinds of products do they carry again it's just a letter I don't know what it means but maybe it's like some are electronics maybe some are supermarkets maybe some are full spectrum how far away is the nearest competitor and what year and month did the competitor open for business notice that sometimes the type the competitor open for business quite late in the game like like later than some of the data we're actually looking at so that's going to be a little bit confusing and then this thing called promo to which as far as I understand it is basically is this a store which has some kind of standard promotion timing going on and so you can see here that these are this store has standard promotions in January April July and October so that's the stores we also know for each store what state they're in based on the abbreviation and then we can find out for each state what is the name of that state and then for each this is slightly weird this is a state abbreviation the last two letters in this state during this week this was the Google trend data but some keyword I'm not sure what keyword it was for this state name on this date is the temperature viewpoint so forth and then finally here's the test set it's identical to the training set but we had don't have the number of customers and we don't have the number of sales okay so like this is a pretty standard kind of industry data set it's you know we've got some things we've got a kind of a central table various tables related to that and some things representing time periods or time points one of the nice things you can do in pandas is to use this pandas summary module and call data frame summary for a table dot summary and that will return a whole bunch of information about every field so I'm not going to go through all of it in detail but you can see for example for the number for the sales you know on average 5800 sales standard deviation of 3800 sometimes the sales goes all the way down to zero sometimes all the way up to 41000 there's no missing the sales that's good to know so this is kind of thing that's good to kind of scroll through and identify you know okay competition opens since month is missing about a third of the time that's good to know there's 12 unique states that might be worth checking because it's actually 16 things in our state table for some reason Google trend data is never missing that's good the year goes from 2012 through 2015 the weather data is never missing and then here's our test set you know this is the kind of thing that might screw up a model is like actually sometimes the test set is missing the information about whether that store was open or not so that's something to be careful of okay so we can take that list of tables and just destructure it out into a whole bunch of different table names find out how big the training set is find out how big the test set is and then you know with this kind of problem there's going to be a whole bunch of data cleaning and a bunch of feature engineering and so you know neural nets don't make any of that go away particularly because we're using this style of neural net where you know we're basically feeding in a whole bunch of separate continuous and categorical variables so simplify things a bit turn state holidays into bullions and then I'm going to join all of these tables together right so I always use a default join type of an outer join right so you can see here this is how we join in pandas we say table dot merge table two and then to make a left out of join how it was left and then you say okay what's the name of the fields that you're going to join on the left hand side what are the fields you're going to join on the right hand side and then if both tables have some fields with the same name what are you going to suffix those fields with so on the left hand side we're not going to add any suffix on the right hand side will put in underscore y so you know again I try to refactor things as much as I can so we're going to join lots of things let's create one function to do the joining and then we can call it lots of times was there any fields are referring to the same value but name differently not that I saw no and it wouldn't matter too much if there were because you know when we run the model no problem we have a question in the forum would you liken the use of embeddings from a neural network to extraction of implicit features or can we think of it more like what a PCA would do like dimensionality reduction let's talk about it more when we get there but I mean basically when you deal with categorical variables in any kind of model you have to decide what to do with them right and one of my favorite data scientists are a pair of them actually who are very nearly neighbors of Rachel and mine have this fantastic our package called vTreat which has like a bunch of state-of-the-art approaches to to dealing with stuff like categorical variable encoding and so you know the the obvious way to do categorical variable encoding is to just do a one-hot encoding and that's the way nearly everybody puts it into their gradient boosting machines or random bursts or whatever but one of the things that vTreat does is it has some much more interesting techniques so for example so this is a John and Nina's package you know you could look at the univariate mean of sales for you know each day of week and you could kind of encode day of week you know using a continuous variable which represents the mean of sales but then you have to think about like oh well would I take that mean from the trading set or the test set or the validation set how do I avoid overfitting you know there's all kinds of complex statistical subtleties to think about that vTreat handles all this stuff automatically now that is you know that there's a lot of great techniques but they're kind of complicated and in the end they tend to make a whole bunch of assumptions about like linearity or univariate correlations or whatever where else with embeddings we're using SGD to learn how to deal with it you know just like we do when we build an NLP model or or a collaborative filtering model you know we provide some initially random embeddings and this you know the system learns you know how to movies vary and compare to each other or users vary or words vary or whatever so you know this is a to me the ultimate pure technique and and of course the other nice thing about embeddings is we get to pick the dimensionality of the embedding so we can decide how much complexity and how much learning are we going to put into each of the categorical variables and we'll see how to do that in a moment okay so one complexity was that the weather yes the weather uses the name of the state rather than the abbreviation of the state so we can just go ahead and join weather to states to get the abbreviation the Google trend information about the week you know week from from a to b we can split that apart you can see here one of the things that happens in the Google trend data is that one of the states is called ni or else in the rest of the data is called hb comma ni so this is a good opportunity to learn about pandas indexing so pandas indexing most of the time you want to use this dot ix method right and the dot ix method is your kind of general indexing method and it's going to take two things a list of rows to select and a list of columns to select and you can use it in kind of pretty standard intuitive ways this is a lot like numpy right this here is going to return a list of bullions which things are in this state and if you pass the list of bullions to the pandas row selector it will just return the rows where that bullion is true so therefore this is just going to return the rows in Google trend where Google trend dot state is ni and then the second thing we pass in is a list of columns in this case we just got one column and one very important thing to remember again just like numpy you can put this kind of thing on the left hand side of an equal sign in computer science we call this an L value so you can use it as an L value so we can take this state field for things which are equal to ni and change their value to this okay so this is like a very nice simple technique that you'll use all the time in pandas both for looking at things and for changing things excuse me we have a few questions one is in this particular example do you think the granularity of the data matter as in per day or per week is one better than the other yeah i mean i would want to have the lower granularity so that i can capture that i mean the ideally you'd want time as well it kind of depends how the organization is going to use it right like do they what are they going to do with this information it's probably for like purchasing and stuff so maybe they don't care about an hourly level but clearly the the difference between sunday sales and wednesday sales will be quite significant so yeah this is mainly a a kind of business context or domain understanding question another question is do you know if there's any work that compares for structured data supervised embeddings like these two embeddings that come from an unsupervised paradigm such as an autoencoder it seems like you'd get more useful for for prediction embeddings with the former case but if you wanted general purpose embeddings you might prefer the latter yeah i mean i think you guys are aware of my feelings about autoencoders you know it's like giving up on life you know you can always come up with a loss function that's more interesting than an autoencoder loss function basically i would be very surprised if embeddings that came from a sales model were not more useful for just about everything that are something that come from an unsupervised model that these these things are easily tested and if you do find a model that they don't work as well with then you can come up with a different set of supervised embeddings for that model and then there's also just a note that dot ix is deprecated and we should use dot loke instead okay i was going to mention pandas is changing a lot and because i've been running this course i have not been keeping track of the recent versions of pandas so thank you so that should be google trend dot loke there's in pandas there's a whole page called advanced indexing methods i don't find the pandas documentation terribly clear to be honest but there is a fantastic book by the author of pandas called python for data analysis there is a new edition out and it covers pandas numpy matplotlib whatever yeah that's the best way by far to actually understand pandas because the documentation is a bit of a nightmare and it keeps changing so the new version has all the new stuff in it yeah with these kind of indexing methods pandas try is really hard to be intuitive which means that like quite often you'll read the documentation for these methods and it will say like you know if you pass it a boolean it'll behave in this way if you pass it a float it'll behave this way if it's an index it's this way unless this other thing happens and like i don't find it intuitive at all because in the end i need to know how something works in order to use it correctly and so you end up having to remember this huge list of things right so this is i think pandas is great but this is one thing to be very careful of is to really um make sure you understand how all these indexing methods actually work and i know Rachel's laughing because she's been there and you know probably laughing and disgust at what we all have to go through um another question when you use embeddings from a supervised model in another model do you have to worry about data leakage um yes you always have to worry about data leakage um i think that's a great point you know uh i i don't think i've got anything to add to that you can figure out easily enough if there's data leakage so yeah um that's a great question and you need yeah definitely something to think about okay um so there's this kind of standard set of steps that i take uh for every single structured machine learning model i do um and one of those is every time i see a date i always do this right i always create four more fields the year the month of year the week of year um and the day of week um this is um you know something which should be like automatically built into every data loader i feel um it's so important because you know these are the kinds of structures that you see and once every single date has got this added to it um you're doing great so you can see that i uh add that into um all of my um tables that have a date field um so we'll have that from now on um so now i go ahead and do all of these outer joins and you'll see that the first thing i do after every outer join is check whether the thing i just joined with has any nulls right so like even if you're sure that these things match perfectly i would still never ever do it in a join right do the outer join and then check for nulls and that way if anything changes ever or if you ever make a mistake one of these things will not be zero right um if this was happening in a production process you know this would be a this would be an assert this would be you know emailing henry at 2am to say like you know something you're relying on is not working the way it was meant to look out um so that's why i always do it this way so you can see i'm just basically joining my training um to everything else um until it's all in there together in one big thing um okay so that table everything joined together is called joined and then um i do a whole bunch more thinking about well i didn't do the thinking the the people that won this competition um then i replicated their results from scratch um you know think about um what are all the other things you might want to do with these dates so our competition open we met we noticed before a third of the time they're empty uh so we just fill in the empties with some kind of sentinel value um because a lot of um machine learning systems don't like missing values um fill in the missing months with some sentinel value um again keep on filling in missing data so fill in a is a really important thing to be aware of um Richard is the filling in the month with one isn't one and a real value isn't that a problem i guess the answer is yes it is a problem um but um in this case i happen to know that um every time year is empty month is also empty and we only ever use both of them together so any model whether it be tree based or neural net based or whatever is going to take advantage of that um of that fact so yeah um probably would have been safer to pick something else okay so um we don't really care actually when the competition store was opened what we really care about is how long is it between when they were opened and the particular row that we're looking at you know the sales on the second of February 2014 how long was it between second of February 2014 and when the competition opened so you can see here we use this very important dot apply function um which just runs a um a python function on every row of a data frame um and in this case the function is to create a new date um from the open since year and the open since month and we're just going to assume that it's the middle of the month um and that's our competition open since and then we can get our days open by just doing a subtract um in um pandas every date field has this special magical dt property which is what all the then you know days month year all that stuff is sits inside this little dt property um now sometimes as i mentioned um the competition actually opened later than the particular observation we're looking at so that would give us a negative so if we replace our negatives with zero um and we're going to um use an embedding for this so that's why we replace days open with months open um so we have less um values um yeah i mean i i i didn't actually try replacing this with a continuous variable um i suspect it wouldn't make too much difference um but you know this is what they did um in order to make the embedding again not too big um they replaced anything that was bigger than two years with two years okay so there's our unique values every time we do something print something out to make sure that the thing you thought you did is what you actually did it's much easier if we're using excel because like you see straight away what you're doing but in python you know this is the kind of stuff that you have to really be rigorous about checking your work at every step you know when i build stuff like this i generally make you know at least one error in every cell so check carefully okay do the same thing for the promo days um turn those into weeks um yeah okay so that's some basic pre-processing you get the idea of how pandas works hopefully um so the next thing that they did in the paper um was um again some very common kind of time series feature manipulation um one to be aware of which is they basically wanted to say okay um every time there's a promotion or every time there's a holiday i want to create some additional fields um for every you know for every one of our training set roles which is on a particular date which is like on that date how long is it until the next holiday how long is it until the previous holiday how long is it until the next promotion how long is it since the previous promotion so if we basically create those fields um you know this is this is the kind of thing which it's super difficult for any gbm or random forest or neural net to figure out how to calculate itself right it's it's not a um you know there's no obvious kind of mathematical function that is going to build on its own so we this is the kind of feature engineering that we have to do uh in order to allow us to use these kinds of techniques effectively on time series data so a lot of people who work with time series data uh you know particularly in academia outside of industry they're just not aware of the fact that the you know the state of the art approaches really involve all these characteristics right you know separating out your dates into their components turning everything you can into durations both forwards and backwards and also doing a bunch of you'll see in a moment um running averages so um you know when i used to do a lot of this kind of work um i had you know a bunch of library functions that i would run on every file that came in and would automatically do these things for every combination of dates you know um so this thing of like how long until the next promotion how long since the previous promotion is um not easy to do in any database system pretty much uh or indeed in pandas because generally speaking these kind of systems are looking for relationships between tables but we're trying to look at relationships between rows right so i had to create this um tiny little simple little class to do this so basically what happens is let's say i'm looking at school holiday right so i sort my data frame by store and then by date and i call this little function called at elapsed school holiday after and what does at elapsed do at elapsed is going to create an instance of this um class called elapsed and in this case it's going to be called with school holiday so what this class is going to do we're going to be calling this apply function again to remember it's going to run on every single row and it's going to call my elapsed class dot debt for every row so i'm going to go through every row in order of store in order of date and i'm trying to find how long has it been since the last school holiday so when i create this object i just have to keep track of okay what field is it okay school holiday initialize when was the last time we saw a school holiday okay and the answer is we haven't so it's you know initialize it to not a number um and we also have to know like each time we like cross over to a new store like when we cross over to a new store we basically have to initialize so the previous store was zero so every time we call get we basically check okay have we crossed over to a new store and if so just initialize both of those things back again and then we just say okay um is this a school holiday if so then the last time you saw a school holiday is today and then finally return how long is it between today and the last time you saw a school holiday right so it's basically this class is a way of keeping track of kind of um some memory about like when did i last see this observation so then by just calling df.apply it's going to keep track of this for every single row and so then i can call that for school holiday after and before the only difference being that i for before i just sought my dates in descending order state holiday and promo okay so that's going to add in the end six fields how long until and how long since the last school holiday state holiday and promotion and then there's a question or two questions um one asking is this similar to a windowing function um not quite we're about to do a windowing function okay and then is there a reason to think that the current approach would be problematic with sparse data um i don't see why but i'm not sure i quite follow you got one next door so we don't care about absolute days right we care about time deltas between events we care about two things we care we do care about the dates but we've already like pulled out the we don't we care about the like what year is it what week it is yeah um and we also care about the elapsed time between the uh the daytime predicting sales for and the previous and next of various events okay and then windowing functions uh to do another yes um for the features that are time until an event how do you deal with that given that not you might not know when the last event is in the data well um all i do is i've sort of sorted descending right and then we initialize last with not a number so basically um when we then go subtract here we are subtracted it tries to subtract not a number we'll end up with a null right so basically anything that's uh an unknown time because it's at one end or the other is going to end up null which is why um we uh maybe it's later on but shortly um we're going to replace those nulls with um with zeros okay um pandas has this slightly strange way of thinking about indexes but once you get used to it it's fine at any point you can call dataframe.set index and pass in the field um you then have to just kind of remember what field you have as the index because quite a few methods in pandas use the kind of currently active index by default and of course things will run faster when you do stuff with a currently active index um and you can pass multiple fields in which case you end up with a multiple key index okay so the next thing we do is these um windowing functions so a windowing function uh in pandas we can use this rolling so this is like a rolling mean rolling min rolling max whatever you like so this basically uh says all right um let's take um our dataframe um with the columns we're interested in school holiday state holiday and promo and we're going to keep track of like how many holidays are there in the next week and the previous week how many promos are there in the next week and the previous week so um to do that we can sort here we are by date group by store and then rolling will be applied to each group so within each group create a rolling seven day sum okay so it's kind of like you know um it's the kind of uh notation i'm never likely to remember you know but you just look it up and so this is how you do like group by type stuff um pandas is actually has quite a lot of time series functions um and this rolling function is one of the most useful ones you know where's bikini uh had a background as a quantity of memory says correctly and um so you know quants love their time series functions i think that was a lot of the history of with pandas so if you're interested in time series stuff you'll find a lot of time series stuff in pandas okay um one helpful parameter um that sits inside a lot of methods is in place equals true um that means that rather than returning a new dataframe with this change made it changes the dataframe you already have and uh when your dataframes are quite big this is going to save a lot of time and memory that's a good little trick to know about okay so now we merge all these together and we can now see that we've got all these after-school holiday before school holiday and our backward and forward running means um and then we join that up to our original dataframe and here we have our final result so there it is um you know we ended up we started out with a pretty small set of fields in the training set um but we've done this uh this feature engineering and um this feature engineering is not arbitrary like although i didn't create this solution i was just re-implementing the solution that was um came from the competition third place getters um this is nearly exactly the set of um feature engineering steps i would have done like it's just a really standard way of thinking about a time series okay so you know you can definitely borrow these ideas um pretty closely okay so now that we've got that um you know we've got this this table we've done our feature engineering um we now want to feed it into a neural network so um to feed it into a neural network we have to do a few things um the categorical variables have to be turned into um um one hot encoded variables or at least into um contiguous integers and the continuous variables we probably want to normalize to a zero mean one standard deviation there's a very little known package called sklearn pandas um and actually i contributed some new stuff to it for this course to make this even easier to use um if you use this data frame mapper from sklearn pandas as you'll see it makes life very easy without it life is very hard and because very few people know about it the vast majority of code you will find on the internet makes life look very hard okay so use this code not the other code um actually i was talking to some of the students the other day and they were saying like for their project they were like stealing lots of code from part one of the course because they just couldn't find you know anywhere else people writing any of any of the kinds of code that we've used like you know the stuff that we've learnt throughout this course is in on the whole not not code that lives elsewhere very much at all so like feel you know feel free to use a lot of these functions in your own work you know because i've really tried to make them the best version of that function so one way to do the embeddings and the way that they did it in the paper is to basically say for each categorical variable they just manually decided what embedding dimensionality to use they don't say in the paper how they pick these dimensionalities but generally speaking things with a larger number of separate levels tend to have more dimensions so like there's i think there's like a thousand stores so that has a big embedding dimensionality um where else obviously things like promo forward and backward or they have weak or whatever you know have much smaller ones um so this is this dictionary i created that basically goes from the name of the field to the embedding dimensionality again this is all code that you guys can use in your models um so then all i do is i say okay my categorical variables is okay go through my dictionary sort it in reverse order of the value um and then get the first thing from that so that's just going to give me the keys from this in a reverse order of dimensionality um continuous variables it's just a list okay um uh just make sure that there's no nulls so continuous variables replace nulls with zeros categorical variables replace nulls with empties and then here's where we use the data frame mapper a data frame mapper takes a list of tuples a list of tuples with just two items in the first item is the name of the variable so this case i'm looping through each categorical variable name the second thing in the tuple is a class uh or actually an instance of a class um which is going to do your preprocessing and there's really just two that you're going to use almost all the time the categorical variables um sklearn comes with something called label encoder um it's it's really badly documented in fact we're misleading it documented but this is this most this is exactly what you want it's something that takes um a column figures out what are all the unique values that appear in that column and replaces them with a set of contiguous integers right so if you've got the days of the week monday through sunday it'll replace them with zero through zeros through sevens and then very importantly you know this is critically important you need to make sure that the training set and the test set have the same codes but there's no point having sunday be zero in the training set and one in the test set so because we're actually instantiating this this class here this object is going to actually keep track of which codes it's using um and then ditto for the continuous we want to normalize them to zero one um variables but again we need to remember what was the mean that we subtracted what was the standard deviation we divided by so that we can do exactly the same thing to the test set otherwise again our models are going to be nearly totally useless so the way the data frame mapper works is that it it's using this instantiated object it's going to keep track with its information so this is basically code you can copy and paste in every one of your models right once we've got those mappings you just pass those to a data frame mapper and then you call dot fit um passing in your data set and so this thing now is a special object which has a dot features um property that's going to contain all of the features all of the pre-processed features that you want so categorical columns contains the result of doing this mapping basically doing this label encoding um so yeah so it you know in some ways the details of how this work doesn't matter too much because you could just use exactly this code and every one of your models same for continuous there's exactly the same code um but of course continuous it's going to be using um standard scalar which is the skykit learn thing that turns it into a zero mean one standard deviation unvariable so we've now got continuous columns have all been standardized so here's an example of uh the first five um rows from the zero column um for a categorical and then ditto for a continuous okay so you can see these have been turned into um integers and these have been turned into numbers which are going to average to zero and have a standard deviation of one um one of the nice things about um this data frame mapper is that you can now take that object and actually store it pickle it and so now you can use those um um categorical encodings and scaling parameters um elsewhere by just unpickling it you've immediately got those same um same parameters um for my categorical um um variables you can see here the number of unique classes in every one okay so here's my thousand 100 stores and 31 uh days of the month and seven days of the week and so forth all right so um that's the kind of key preprocessing that has to be done um so here is their big mistake and i think if they didn't do this big mistake they probably would have won their big mistake is that they went join join dot sales not a quarter zero so they've removed all of the rows with no sales and those are all of the rows where the store was closed why was this a big mistake because if you go to the rossmann store sales competition website and click on kernels and look at the kernel that got the highest rating um oh this is not the right one sorry this kernels exploratory data analysis rossmann uh i'll show you a couple of pictures um here is an example of a store store seven oh eight and these are all from this side kernel um here is a period of time where it was closed to refurbishment this happens a lot in rossmann stores you get these periods of time when you get zeros for sales lots in a row look what happens immediately before and after right so in the data set that we're looking at our unfortunate third place winners deleted all of these right so they had no ability to build a feature that could find this so this store seven away look here's another one where it was closed same thing so um this turns out to be super common um and the second place winner actually built a feature and you know it's going to be exactly the same feature we've seen before how many days since they're closing and how many days until the next closing and if they had just done that i'm pretty sure they would have won so that was their big mistake um this um this kernel has a number of interesting analyses in it here's another one which um i think our neural net can capture although it might have been better to be explicit um some stores um opened on sundays not most didn't but some did and for those stores that opened on sundays their sales on sundays were far higher than on any other day right and i guess that's because in germany i guess not many shops open on sundays so something else that they didn't explicitly do was create a like is store open on sunday field having said that i think the neural net may have been able to put that in the embedding so like if you're interested during the week you could try adding this field and see if it actually improves it or not certainly be interested to hear if you try adding this field um like do you find that you actually would win the competition um because this this sunday thing um this is these are all from the same cackle kernel here's the day of week um and here's the sales as a box plot um and you can see you know normally on a sunday um it's not that the sales are much higher um so it's really explicitly just for these particular stores okay so that's the kind of um visualization stuff which is really helpful to to do as you work through these kinds of problems is to like i don't know just draw lots of pictures and those pictures were drawn in r and you know r is actually pretty good for this kind of structured data um so i have a question so for categorical fields they're converted by numbers not with mean zero right they were just you know monday zero tuesday one whatever yeah and as ease they will send to neural network just like uh we're gonna get there okay yeah we're gonna use embeddings just like we did with word embeddings remember we turned every word into a into a word index you know so our sentences rather than being like you know um the dog ate the beans it would be you know three six twelve two whatever we're going to do the same basic thing we've done the same basic thing okay so now that we've done our terrible mistake we've now got still got 824 000 rows left as per usual i made it really easy for me to um create a random sample and did most of my analysis with a random sample but can just as easily not do the random sample so now i've got a separate sample version of it um okay split it into training and test now notice here the way i split it into training and test is not randomly and the reason it's not randomly is because in the Kaggle competition they set it up the smart way the smart way to set up um a test set in a time series is to make your test set the most recent period of time right if you choose random points you've got two problems the first is you're predicting tomorrow's sales where you always have the previous day's sales which you know is very rarely the way things really work and then secondly you're ignoring the fact that in the real world you're always trying to model you know a few days or a few weeks or a few months in the future that haven't happened yet so the way you want to set up if you were building um if you were setting up the data for such a model yourself you would need to be deciding okay how often am i going to be rerunning this model how long is it going to take for those model results to get into the field to be used and however they're being used um and in this case i guess they just i can't remember i think it's like a month or two all right so in that case i should make sure there's a month or two gap between you know or a month or two test set which is the last bit right so you can see here i've taken the last 10 percent as my validation set and it's literally just here's the first bit and here's the last bit and since it was um already sorted um by date this ensures that i have it done the way i want i just wanted to point out that it's tentatively so you should probably take a break so we sure will um okay so um this is how you take that um data frame map or object we created earlier remember we called dot fit in order to like learn you know the transformation parameters you then call transform to actually do it right so take my training set and transform it to grab the categorical variables and then the continuous pre-processing is the same thing for my continuous map so pre-process my training set to and grab my continuous variables okay so um that's nearly done um the only final piece is in their solution they modified their their target so their their sales value and the way they modified it was that they found what is the highest amount of sales right and they took the log of that and then they modified all of their y values to take the log of sales divided by the maximum log of sales so what this means is that the y values are going to vary um they're going to be no higher than one um and further more remember how they had a long tail the average was 5 000 but the maximum was like 40 something thousand um this is really common right like most financial data sales data so forth generally has a nicer shape when it's logged than it does not um so taking a log is a really good idea the reason that um that as well as taking the log they also did this division is it means that what we can now do is we can use an activation function in our neural net of a sigmoid right which goes between naught and one um and then just multiply by the maximum log um and so that's basically got to ensure that the data is in the right um scaling area um I actually tried taking this out and it actually um this technique doesn't really seem to help um and it actually reminds me of the style transfer paper um where um they mentioned they originally had like a hyperbolic tan layer at the end for exactly the same reason to make sure everything was like between naught and 255 and actually turns out if you just use a linear activation it worked just as well so um interestingly this idea of um using sigmoids at the end in order to get the right range doesn't seem to be that helpful my guess is the reason why is because for a sigmoid it's like really difficult to get the maximum and I think actually what they should have done is they probably should have like instead of using maximum they should have used like maximum times 1.25 you know so that they never have to predict one you know because it's impossible to predict one um because it's a sigmoid someone asked is there any issue in fitting the pre processors on the full training and validation data shouldn't they be fit only to the training set um no it's fine um in fact um for the categorical variables you know if you don't include the test set then you're going to have some codes that aren't there at all um whereas this way there's going to be random which is better than failing um uh yeah as for deciding you know what to divide and subtract in order to get a zero one random variable it doesn't really matter um it's not it's there's no leakage involved that's what you're worried about okay um root mean squared percent error is what the Kaggle competition used as the official loss function so this is just calculating that um so before we take our break um we'll just uh finally take a look at the definition of the model um I'll kind of work backwards okay here's the basic model right um get our embeddings combine the embeddings with the continuous variables uh tiny bit of dropout one dense layer two dense layers more dropout and then the final sigmoid activation function um you'll see that I've got like commented out stuff all over the place and this is because like I had a lot of questions we're going to cover this up to the break a lot of questions about some of the details of like why did they do things certain ways you know some of the things they did were like so weird I just thought they couldn't possibly be right so I did some experimenting we'll learn more about that in a moment um so the embeddings as per usual I created a little function um to create an embedding which first of all creates my um regular keras input layer and then it creates my embedding layer and then how many embedding dimensions am I going to use sometimes I looked them up in that dictionary I had earlier uh and sometimes I calculated them using this simple approach of saying I will use however many levels there are in the categorical variable divided by two um with a maximum of 50 so like these were two different techniques I was playing with um normally with um like word embeddings you have a whole sentence right and so you end up you've got to feed it to an r and n and so you have time steps right so normally you have an input length equal to the your length of your sentence right this is the time steps for an r and n we don't have an r and n we don't have any time steps we just have one element in one column so therefore I have to pass flatten after this because it's going to have this redundant unit one time axis that I don't want right so this is just because you know people don't normally do this kind of stuff with embeddings so they're assuming that you're going to kind of want it in a format ready to go to an r and n so this is just turning it back into a normal format so we grab each embedding um we end up with a whole list of those um we then combine all of those embeddings with all of our continuous variables into a single list of variables um and so then our model um is going to have um all of those embedding inputs and all of our continuous inputs and then we can compile it and train it all right so let's take a break and uh see you back here at um five past eight okay so um so we've got our neural net um set up um we train it in the usual way um go dot fit um and um yeah away we go so um that's basically that um and yeah it's um it trains reasonably quickly um six minutes in this case um sorry I'm sure yeah so we got two questions that came in I think right when we wrote yeah no problem one of them is for the normalization is it possible to use another function other than log such as sigmoid um I don't think you'd want to use sigmoid like a kind of financial data and sales data and stuff tends to be of a shape where log will make it more linear um which is generally what you want and then when we lock um log transform our target variable we're also transforming the squared error is this a problem or is it helping the model to find a better minimum error in the untransformed space yeah so you've got to be careful about what loss function you want in this case the calcule competition is trying to minimize root mean squared percent error um so I actually then said okay I want you to do mean absolute error um because that's you know in log space that's kind of basically doing the same thing right because the percent is a ratio um so this is the absolute error between two logs which is basically the same as a ratio so yeah you need to make sure your loss function is appropriate in that space and I think this is one of the things that they didn't do in the original competition and I tried as you can see I tried changing it I think it helped um okay um by the way um XGBoost is fantastic um here is the same um series of steps to run this model with XGBoost as you can see I just uh concatenate my categorical and continuous for training and for my validation set um here is a set of parameters which tends to work pretty well um XGBoost has a data type called D matrix a data matrix which is basically a normal matrix but it keeps track of the names of the features so it like prints out more you know better um better information and then you go dot train and this takes like less than a second to run and it's not massively worse than um our previous result and so this is like a good way to kind of get started um the reason that XGBoost and random forest is particularly helpful is because it does something called variable importance right so this is how you get the importance for an XGBoost model so it takes a second and suddenly here is the information you need so when I was having trouble um um replicating the original results from the third place winners one of the things that helped me a lot was to look at this feature importance plot and say okay competition distance holy cow that's really really important let's make sure that my competition distance results you know pre-processing really is exactly the same um and on the other hand events doesn't really matter at all so I'm not going to worry really at all about checking my events um this um feature importance or variable importance plot also as it's known um you can also create with a random forest um these these are amazing right because you're using a tree ensemble it doesn't like it doesn't matter the shape of anything it doesn't matter if you have or don't have interactions you know this is all like totally kind of um assumption free um so like this is in real life this is the first thing I do like the first thing I do is try to get a feature importance plot printed because that tells me you know often it turns out there's only three or four variables that matter right and like because you've got 10 000 variables so I worked on a big credit scoring problem a couple of years ago I had nine and a half thousand variables it turned out that only nine of them mattered right and so the company I was working for like literally it spent something like five million dollars on this big management consulting project and this big management consulting project had told them all these ways in which they can like capture all this information in this really clean way for their credit scoring models and of course none of those things were in these nine that mattered you know so they could have saved five billion dollars but they didn't because management consulting companies don't use random forests um so you know I can't overestimate you know uh overstate the importance of this plot but you know it's this is a deep learning course so we're not really going to spend time talking about it um now um I mentioned that I had um a whole bunch of like there's a whole bunch of like really really really weird things in the way that the competition third place getters did things um for one they didn't normalize their continuous variables but who does that but then when people do well in a competition something's working you know um the the ways in which they initialized uh their embeddings were really really weird but there's all these things that are really really weird so what I did was I wrote a um a little script um Rosman experiments and what I did uh was basically I copied and pasted all the important code out of my notebook and remember I've already pickled the parameters for the you know labeling coder and the scaler so I didn't have to worry about doing those again um and um once I copied and pasted all that code in so this is exactly all the code you just saw I then had this um um you know bunch of for loops right pretty inelegant right but these are all of the things that I wanted to basically find out like does it matter whether or not you use um you know one zero scaling does it matter whether you use their weird approach to initializing embeddings does it matter whether you use their particular dictionary of um embedding dimensions or use my simple little formula something else I tried is you know they basically took all their continuous variables and put them through like a separate little dense layer each and I was like why don't we put them all together um now let's try some other things like using batch normalization so I ran this and got back every possible combination of these right so this is like this is where you want to be using the script you know and like I'm not going to tell you that I jumped straight to this first of all I spent days screwing around the experiments in a notebook by hand continually forgetting what I had just done um until eventually I just you know it took me like an hour to write this and then of course I pasted it into excel and here it is right chucked it into a pivot table uh used conditional formatting and here's my results right so you can see all my different combinations with and without um normalization um with my special function versus their dictionary um using a single dense matrix versus putting everything together using their weird in it versus not using a weird in it sorry my my in it versus um their um lack of in it um and here is this dark blue here is their um what they did right and like to it's full of weird to me right um but as you can see it's actually the darkest blue it's actually it actually is the best but then when you kind of zoom out you realize there's a whole corner over here that's like got a couple of eight sixes it's nearly as good right um but seems much more consistent which is and and also more consistent with sanity like yes do normalize your data and yes do use an appropriate initialization function and if you do those two things and it doesn't really matter what else you do it's all going to work fine so what I then did was I created a little um sparkline in excel for the actual training graphs right and so here's their here's their winning one again 0.085 um but here's the variance um of like getting there and as you can see their approach was kind of pretty bumpy up and down up and down up and down the second best on the other hand right 0.086 rather than 0.085 is going down very smoothly right and so that made me think given that it's in like this very stable part of the world and given it's training much better I actually think this is just random chance you know it just happened to be low in this point I actually thought this is a better approach you know it's more it's more sensible and it's more consistent so um this kind of approach to running experiments I thought I'd just show you um you know to say like when you run experiments try and do it in a rigorous way and track both the kind of stability of the approach as well as the actual result of the approach so you know this one here makes so much sense it's like use my simple function rather than their weird dictionary use normalization use a single dense matrix and use a thoughtful initialization and you do all of those things you end up with something that's basically is good and much more stable okay um that's what I wanted to say about Rossman I'm going to very briefly mention another competition which is the Kaggle taxi destination competition um the Kaggle destination yeah question over here throw um you you were saying that um you did a couple of experiments one um you figured out the embeddings and then put the embeddings into random forest and then put embeddings again into neural network oh I didn't do that the um that that was from the paper um yeah so I don't understand because you just use one neural network to do everything together no yeah so what they did was for this one here this one one five they they trained the neural network I just showed you they then threw away the neural network and trained a gpm model but for the categorical variables rather than using 100 encodings they used the embeddings that's all okay so the taxi competition um um was won by the team with the this unicode name which is pretty cool um and it's actually turned out to be a team run by Yoshua Benjiro who's um you know one of the people that stuck it out through the ai winter and is now kind of one of the the leading lights um in deep learning and um interestingly the thing I just showed you the rosman competition um they actually this paper they wrote in the rosman competition they claimed to have invented this idea of categorical embeddings um but actually Yoshua Benjiro's team won this competition a year earlier with his same technique but you know again it's so uncool nobody noticed even though it's Yoshua Benjiro so I want to quickly show you what they did um this is the paper they wrote right and their approach to picking an embedding size was very simple use ten okay so um the data was um which customer is taking this taxi which taxi are they in which taxi stand did they get the taxi from and then quarter hour of the day day of the week week of the year and they didn't add all kinds of other stuff this is basically it right and so then they said okay we're going to learn embeddings inspired by NLP right and so actually to my knowledge this is the first time this appears in the literature having said that I'm sure um a thousand people started before it's just no obvious direction negative for paper as a quick sanity check if you have day of the week like seven even one seven one hot variable potentials um an embedding size of ten that that doesn't make any sense right yeah so I used to think that um but actually it does like since I've in the last few months quite often ended up with bigger embeddings than my original patternality and often it does give better results and I think it's just like when you realize that it's just a dense layer on top of a one hot encoding it's like okay why shouldn't the dense layer have more information you know um yeah I found it weird too I still find a little weird but it definitely seems to be something that's quite useful I'm sorry you lost that thank you it doesn't hurt it doesn't help no it does it helps it helps yeah yeah I absolutely found plenty of times now where I need a bigger embedding matrix dimensionality than my than my cardinality of my categorical variable okay um now in this competition um again it's a time series competition really because um the the main thing you're given other than all this metadata um is a series of gps points which is every gps point along a route and at some point for the test set the route is cut off and you have to just figure out what the final gps point would have been where are they going here's the model that they won with it turns out again to be very simple you take all the metadata we just saw and chuck it through the embeddings you then take the first five gps points and the last five gps points and concatenate them together with the embeddings chuck them through the hidden layer then through a softmax and now this is quite interesting um what they then do is they take the result of this softmax right and they combine it with clusters now what are these clusters they used mean shift clustering and they used mean shift clustering to figure out where are the places people tend to go right so with taxis people tend to go to the airport or they tend to go to the hospital or they tend to go to the shopping street right so using mean shift clustering they came up with I think it was about 3000 um clusters you know x y coordinates of places that people tended to go however people don't always go to those 3000 places so this is the really cool thing by using a softmax and then they took the softmax and they multiplied it see here multiplied it and took a weighted average using the softmax as the weights and the cluster centers as the thing that you're taking the weighted average of so in other words if they go into the airport for sure the softmax will end up giving a p you're very close to one for the airport cluster on the other hand if it's not really that clear whether they're going to this shopping strip or this movie then those two cluster centers could both have a softmax of about 0.5 and so it's going to end up predicting somewhere halfway between the two so like this is really interesting like they've built a different kind of architecture to anything we've seen before where the softmax is not the last thing we do right it's being used to average a bunch of clusters and so this is really smart because the softmax forces it to be easier for it to pick a specific destination that's very common but also makes it possible for it to predict any destination anywhere by combining you know the average of a number of clusters together right and i think this is really elegant architecture engineering um so trajectory prefix so last four dots it's like last five gps points but that's isn't that what we're trying to predict last five gps points that we're given right and so to create the training set what they did was they took all of the routes and they truncated them randomly basically so every time they sampled another route think of the data generator basically the data generator would like randomly slice it off somewhere so this was the last five points which we have access to and the first five points the reason it's not all the points is because they're using a standard multi-layer perceptron here so this you know it's a variable length you know a and also you don't want it to be too big there's a question so the prefix is not fed into an rnn it's just fed into a dense layer correct yeah so we just get 10 points concatenated together into a dense so surprisingly simple how good was it well um look at the results 214 214 213 213 213 213 211 or really 212 you know everybody's clustered together one person's a bit better at 208 and they're way better at 203 and then they mentioned by the way in the paper they didn't actually have time to finish training so when they actually finished training it was actually 1.87 right so they like they won so easily it's not funny and interestingly in the paper they actually mentioned the test set was so small that the only they knew that the only way they could be sure to win was to make sure they won easily um now because the test set was so small the leaderboards actually not statistically that great so they created a custom test set um and tried to see if they could find something that's even better still on the custom test set and it turns out that actually an rnn is better still um it still would have won the competition but um just there's not enough data in the Kaggle test set that this is a statistically significant result in this case it is statistically significant a regular rnn wasn't better but what they did instead was they said let's take an rnn where we pass in five points at a time into the rnn basically um I think what probably would have been even better would be to have had a convolutional layer first and then pass that into an rnn they didn't try it as far as I can see from the paper so and importantly a bi-directional rnn which ensures that the initial points and the last points tend to have more weight because we know that rnn's you know their state generally reflects things they've seen more recently so um so this result of this model um so our um paul long suffering intern brad has been trying to replicate this result uh he hasn't had at least two ordinators in the last two weeks but hasn't quite managed to yet so I'm not going to show you the code but hopefully once brad starts sleeping again he'll be able to finish it off and we can show you um the notebook uh during the week on the forum that actually um uh re-implements this thing um you know it was an interesting process to watch brad try to replicate this you know because the vast majority of the time in my experience when people say they've tried a model um and the model didn't work out and they've given up on the model it turns out that it's actually because they screwed something up not because of the problem with the model um and and if you weren't comparing to Yoshua Benjiyo's team's result knowing that you haven't replicated it yet you know at which point do you give up and say oh my model's not working versus saying no I have I've still got bugs right it's very difficult to debug um machine learning models so um you know what brad's actually had to end up doing is literally take the original Benjiyo team code run it line by line and then try to replicate it in Keras line by line and like literally NP dot all close every time um because you know to build a model like this it doesn't look that complex but there's just so many places that you can make little mistakes no no normal person will make like zero mistakes in fact normal you know normal people like me will make dozens of mistakes so when you build a model like this you need to find a way to test every single line of code any line of code you don't test I guarantee you'll end up with a bug and you won't know you have a bug and there's no way to ever find out you had a bug um so yeah did you hear saying Rachel? So we have several questions yes um one is it's a note that a PI times CI is very similar to what happens in the memory network paper in that case the output embeddings are weighted by the attention probability. Yeah or you know it's a lot like a regular um intentional um language model yeah. Can you talk more about the idea you have about first having the convolutional layer and passing that to an RNN what do you mean by that? So here is um a fantastic paper um which is uh I've still got it up let me just find this paper oh here it is yes um you know we looked at um these uh um uh kind of um sub-word encodings last week for language models um I don't know if any of you thought about this and wondered like well what if we just had um individual characters and uh there's a really fascinating paper called fully character level machine translation with no explicit segmentation um it's from November of last year and they actually get fantastic results um on just character level beating pretty much everything including the BPE approach we saw last time I'm just trying to find the results here we are um yeah so they looked at lots of different approaches and um you know comparing BPE to individual character and most of the time they got the best results their model looks like this right they start out with every individual character it goes through a character embedding just like we've used character embeddings lots of times and then you take those character embeddings and you pass it through a one-dimensional convolution I don't know if you guys remember but in part one of this course Ben actually had a blog post about showing how you can do like multiple size convolutions and catenate them all together so you could use that approach or you could just pick a single size right so you end up basically scrolling your convolutional window uh across your sets of characters and so you end up with the same number of convolution outputs as you started out with letters right but they're now representing the information in a window around that letter right um in this case they then did max pooling right so they basically said okay which window um assuming that you know maybe we had I think so they can see they've got different sizes a size four a size three and a size five um which bits you know seem to be seem to have got the highest activations around here and then they took those max pooled things and they put them through a second set of segment embeddings they then put that through seven quarter highway network which the details don't matter too much it's kind of something like a a dense net like we learned about last week this is a slightly older approach than the dense net and then finally after doing all that stick that through an iron in right so the idea here in this model was they basically you know did as much learnt pre-processing as possible and then finally put that into an RNN and because we've got these max pooling layers this RNN ends up with a lot less time points which is really important to minimize the amount of processing in the RNN so I'm not going to go to detail in this but um you know check out this paper because it's really interesting next question is so for the destinations we would have more air for the peripheral points are we taking a centroid of clusters I don't understand that sorry you understand it I mean all we're doing is we're taking the softmax p multiplied by the cluster c multiply them and add them up I thought the first part was asking that with destinations that are more peripheral peripheral they would have higher air because they would be harder to predict this way yeah probably which is fine because by definition they're not close to a cluster center so they're not common and then going back there was a question on the Rossman example what does MAPE with neural network mean I would have expected that result to be the same why is it lower no this is just using a one-hot encoding without without an embedding layer okay um all right um gosh we kind of run out of time a bit quickly but I really want to show you this so quite a few of the students and I have been trying to get a new approach to segmentation working and I finally got it working in the last day or two and I really wanted to show it to you we talked last week about dense net and I mentioned that dense net is like ass-kickingly good at doing image classification with a small number of data points like crazily good but I also mentioned that it's the basis of this thing called the 100 layers tiramisu which is an approach to segmentation so segmentation refers to taking a picture an image and figuring out you know where's the tree where's the dog where's the bicycle and so forth so it seems like we're Joshua Benjio fans today because this is one of his group's papers as well um let me set the scene so um Brandon one of our students who many of you have seen a lot of his blog posts he has successfully got a pie torch of this working so I've shared that on our files.fast.ai and I got the Keras version of it working so I'll show you the Keras version because I actually understand it and if anybody's interested in asking questions about the pie torch version hopefully Brandon will be happy to answer them during the week um so um the data looks like this there's an image and then there's a and then there's a labels okay so that's basically what it looks like so you can see here you know got traffic lights you've got poles you've got trees buildings but paths roads um interestingly the data set we're using is something called cam bit and it's actually uh the data set is actually frames from a video um so a lot of the frames look very similar to each other and there's only like um 600 frames in total so there's very very little data um in this cam bit data set furthermore we're not going to do any pre-training so we're going to try and build a state of the art classification system on video which is already much lower information content because most of the frames are pretty similar using just 600 frames without pre-training now if you would ask me a month ago I would have told you it's not possible um like this just seems like an incredibly difficult thing to do um but just watch so I'm going to skip to the answer first um and here's an example of a particular frame we're trying to match um here is the ground truth for that frame so you can see there's a tiny car here and a little car here so there are those little cars there's a tree you know trees are really difficult right they're kind of incredibly fine funny things and here is my trained model um and as you can see it's done really really well right um it's it's interesting to look at the mistakes it made this little thing here is a person but you can see that the person it kind of their head it looks a lot like traffic light and their jacket looks a lot like mailbox you know whereas these tiny little people here it's done perfectly right or sorry perfectly yeah um whereas this person it got a little bit confused right another example where it's gone wrong is this should be a road or else it wasn't quite sure what was road and what was footpath which makes sense kind of the colors do look very similar but had we have pre-trained something a pre-trained network would have understood that you know cross roads tend to go straight across they don't tend to look like that right so you can kind of see this you know where this minor mistakes it made it also would have learned had it looked at more than you know a couple of hundred examples of people that people generally are a particular shape right so there's just not enough data for it to have learned some of these things but nonetheless it is extraordinarily effective you know like look at um this traffic light has kind of surrounded by a sign so like the ground truth actually has the traffic light and then a tiny little edge of sign and like it's even got that right you know so it's it's incredibly accurate model um so how does it work and in particular how does it do these amazing trees so the answer is um in this picture um basically until um this is inspired by model called UNET right until the UNET model came along everybody was doing these kinds of segmentation models using an approach just like what we did for style transfer which is basically you have a number of convolutional layers with max pooling or with a stride of two which gradually make the image smaller and smaller pick a receptive field and then you go back up the other side using up sampling or deconvolutions until you get back to the original size and then your final layer is the same size as your starting layer and has a bunch of different classes that you're trying to use in the soft max the problem with that is that you end up with um in fact I'll show you an example there's a really nice paper called UNET UNET is not only an incredibly accurate model for segmentation but it's also incredibly fast it actually can run in real time you can actually run it on a video but the mistakes it makes look at this chair right this chair has a big gap here and here and here but UNET gets it totally wrong right and the reason why is because they use a very traditional you know down sampling up sampling approach and by the time they get to the bottom they've just lost track of the fine detail so the trick are these connections here what we do is we start with our input we do a standard initial convolution just like we do with style transfer we then have a dense net block that's which we learned about last week and then that block okay we keep going down we do a max pooling type thing now the dense net block max pooling type thing keep going down and then as we go up the other side so we do a deconvolution dense block deconvolution dense block we take the output from the dense block on the way down so and we actually copy it over to here and concatenate the two together right so actually Brendan a few days ago actually drew this on our whiteboard when we're explaining it to Melissa and so he's shown us every stage here we start out with a 224 by 224 this is what Brendan thinks the term is who looks like this by the way 224 by 224 input goes through the convolutions with 48 filters goes through our dense block adds another 80 filters it then goes through our they call it a transition down so basically a max pooling so it's now size 112 right we keep doing that dense block transition down so it's now 56 by 56 28 by 28 14 by 14 7 by 7 and then on the way up again we go transition up it's now 14 by 14 we copy across the results of the 14 by 14 from the transition down and concatenate together there we do a dense block transition up is now 28 by 28 so we copy across our 28 by 28 from the transition down and so forth so by the time we get all the way back up here we're actually copying across something that was originally of size 224 by 224 it hadn't had much done to it right it had only been through one convolutional layer and one dense block right so it hadn't really got much rich computation being done but the thing is by the time it gets back up all the way up here the model knows pretty much you know this is a tree and this is a person and this is a house and it just needs to get the fine little details you know where exactly does this leaf finish where exactly does the person's hat finish so it's basically copying across something which is very high resolution that doesn't have that much rich information but that's fine because it really only needs to fill in the details right so these things here they're called skip connections right they were really inspired by this paper called UNET which has been you know one many Kaggle competitions but it's using dense blocks rather than normal fully connected blocks so let me show you and you know we're not going to have time to go into this in detail but I've done all this code in Keras from scratch this is actually like a fantastic fit for Keras I didn't have to create any custom layers I didn't really have to do anything weird at all oh except for one thing data augmentation so the data augmentation was we start with 480 by 360 images we randomly crop some 224 by 224 part and also randomly we may flip it horizontally that's all perfectly fine well Keras doesn't really have the random crops unfortunately but more importantly whatever we do to the input image we also have to do to the target image we need to get the first same 224 by 224 crop and we need to do the same horizontal flip so I had to write a a data generator which you guys may actually find useful anyway so this is my data generator basically I called it a segment generator and it's just a standard generator so it's about a next function right and each time you call next it grabs some random bunch of indexes it goes through each one of those indexes and grabs the necessary item grabbing a random slice sometimes randomly flipping it horizontally and then it's doing this to both the x's and for the y's returning them back along with this segment generator in order to randomly grab a batch of random indices each time I created this little class called batch indices which can basically do that and it can have either shuffle true or shuffle false so this this pair of classes you guys might find really helpful for creating your own data generators this batch indices class in particular you know now that I've written it you can see how it works right if I say batch indices from a data set of size 10 I want to grab three indices at a time so then let's grab five batches now in this case I've got by default shuffle equals false so it just returns 012, 345, 678, 9 and finished go back to the start 012 right on the other hand if I say shuffle equals true it returns them in random order but it still makes sure it captures all of them right and then when we're done it starts a new random order okay so this makes it really easy to create random generators so that was the only thing I had to add to Keras to get this order work other than that you know we wrote the tiramisu and the tiramisu looks very very similar to the dense net that we saw last week right we've got all our pieces the relu the dropout the batch norm the relu on top of batch norm concat layer so this is something I had to add my convolution 2d followed by dropout and then finally my batch norm followed by relu followed by convolution 2d so this is just the dense block that we saw last week so a dense block is something where we just keep grabbing 12 or 16 filters at a time concatenating them to the last set and doing that a few times that's what a dense block is so here's something interesting the original paper for its down sampling they call it transition down did a a one by one convolution followed by a max pooling I actually discovered that doing a stride two convolution gives better results so you'll see I actually have not followed the paper the the one that's in comment commented out here is what the paper did but actually this works better so that was interesting interestingly though on the transition up side do you remember that checkerboard artifacts blog post we we saw that showed that up sampling 2d followed by a convolutional layer works better it does not work better for this in fact a deconvolution works better for this so that's why you can see I've got this deconvolution layer so I thought that was interesting so basically you can see when I go down sampling a bunch of times it's basically do a dense block and then I have to keep track of my skip connections right so basically keep a list of all of those skip connections right so I've got to hang on to all of these so every one of these skip connections I just stick in this little array right depending them after every dense block right and so then I keep them all and then I pass them to my upward path right so basically do my transition up and then I concatenate that with that skip connection right so that's the basic approach and so then the actual tiramisu model itself with those pieces is less than a screen of code right it's basically just do a three by three conv do my down path do my up path using those skip connections then a one by one conv at the end and a softmax okay um so these dense nets and indeed this kind of a fully convolutional dense net or this tiramisu model um they actually take quite a long time to train they don't have very many parameters which is thinking why I think they work so well with these tiny datasets but they do still take a long time to train each epoch took a couple of minutes um and in the end I had to do you know many hundreds of epochs um and you know I was also doing you know a bunch of learning rate annealing so in the end this kind of really had to train overnight even though I had only about five or six hundred friends um but you know in the end training training training training um in the end um I got a really good result now um interest I was a bit nervous at first I was getting this like 87.6 percent accuracy um but in the model in the paper they were getting getting 90 plus it turns out that three percent of the pixels are um marked as void I don't know why they're marked as void but in the paper they actually remove them so you'll see when you get to my results section I've got this bit where I remove those void ones and I ended up with 89.5 percent um none of us in class managed to replicate the paper the paper got 91.5 or 91.2 percent um we tried the lasagna code they provided we tried um Brendan's pie torch um we tried my keras um even though we couldn't replicate their result um this is still better than any other result I found so um this is still you know super accurate um a couple of quick notes about this um first is they tried training uh also on something called the GA tech data set which is another video data set um the degree to which this is an amazing model is really clear here this 76 percent is from a model which is specifically built for video so it actually includes the time component which is absolutely critical and it uses a pre-trained network so it's used like a million images to kind of pre-train and it's still not as good as this model right so like that is an extraordinary comparison um this is the cam tech um comparison here's the model we were just looking at um and again I actually looked into this I thought oh 91.5 or else this one here 88 um wow it actually looks like it's not that much better I'm really surprised uh like even tree like I really thought it should win easily on tree but it doesn't win by very much so I actually went back and looked at this paper and it turns out that the authors of the dense net paper um this is the paper by the way modus gal that they're comparison comparing to um it turned out that they actually trained on crops of 852 by 852 um so they actually used like a way higher resolution image to start with so like you've got to be really careful when you read these comparisons sometimes like people actually kind of shoot themselves on the foot so these guys were comparing their result to another model that was using like twice as bigger picture um so again this is actually way better than they actually made it look like another one like this one here this 88 also looks impressive but then I looked across here and I noticed that the dilution dilution 8 model is like way better than this model on every single category way better and yet somehow the average is only 0.3 better and I realized okay this actually has to be an error right so this model is actually a lot better than um this uh this table gives the impression um so um I briefly mentioned that there's a model which doesn't have any skip connections called enet which is actually better than the tiramisu on everything except for tree but on the tree it's terrible it's 77.8 versus oh hang on 37.3 that's not right uh okay I take that back um I'm I'm sure it was less good than this model but now I can't find that data um anyway the reason I wanted to mention this is that um uh eugenio um is about to release a new model um which combines these approaches with skip connections it's called link net um so you know keep an eye on the forum because I'll be like looking into that um quite shortly okay um you got something Rachel we have a few short questions although I know we're almost out of time so I don't know if you've prepared yeah let's let's answer them on the forum um because like really um you know I actually wanted to talk about this briefly um a lot of you have come up to me and been like we're finishing what do we do now and the answer is um you know we have now created a community of all of these people who have spent well over a hundred hours working on deep learning for many many months and have built your own boxes and written blog posts and you know um that all kinds of stuff you know set up um social impact talks um um written articles and forms uh like okay this community is happening right so it doesn't make any sense in my opinion for Rachel and I to now be saying like here's what happens next right so um just like you know Elena has decided okay I want a book club so she talked to Mindy and we now have a book club and a couple of months time you can all come to the book club um you know so what's next right well you know the forums will continue forever um you know we all know each other um let's like do good shit right most importantly write code right please write code um write code you know build apps um take your work projects and try doing them with deep learning you know build libraries to make things easier you know maybe go back to stuff from part one of the course and look back and think like oh why didn't we do it this other way maybe I can make this simpler you know um write papers right so um I showed you that amazing result of the new style transfer approach from Vincent last week um you know hopefully that might take into a paper write blog posts in a few weeks time um all the moot guys are going to be coming through and doing part two of the course so like help them out on the forum you know teaching is the best way to learn yourself um I I really want to hear the success stories like um people don't believe that what you've done is possible like I I know that because like I like as as as recently as yesterday there was you know the highest ranked hack and use comment on a story about deep learning was about how like it's pointless trying to do deep learning unless you have years of mathematical background and you know C plus plus and you're an expert in machine learning techniques across the board and otherwise there's no way that you're going to be able to do anything useful in the real world project but that it to that today is what everybody believes and we now know that's not true right so Rachel and I are going to start up a podcast where we're going to try to both help deep learning learners but one of the key things you want to do is tell your stories right so if if you've done something interesting at work or you've you know got an interesting new result or you're just in the middle of a project it's kind of fun please tell us right um either on the forum or private message or whatever or whatever like please tell us because you know we really want to share your story and if it's not a story yet you know tell us enough that we can help you you know and that the community can help you um you know get get together you know so you know the book club if you're watching this on the MOOC you know organize other people in your geography to get together and and and meet up or your workplace like in this group here I know we've got people from you know Apple and Uber and Airbnb who started doing this in kind of lunchtime um MOOC chats and now they're here at this course yes Rachel and I also wanted to recommend it would be great to start meetups to help lead other people through say part one of the course kind of assist them going through it yeah yeah absolutely you know um so um you know I'm you know Rachel and I really just want to spend the next six to twelve months focused on supporting your projects you know um so I'm very interested in you know working working on this lung cancer stuff but I'm also interested in like every project that you guys are working on I want to help with that I also want to help people who want to teach this so um Yannette is going to go from being student to teacher hopefully soon and we'll be you know teaching USF students about deep learning and hopefully the next batch of people about deep learning anybody who's interested in teaching you know let us know because we this is the best high leverage activity is to teach the teachers um so yeah you know I don't know where where this is going to end up but my hope is really that I mean basically I would say the experiment has worked you know you guys are all here your reading papers your writing code your understanding the most cutting-edge research level deep learning that exists today we've gone beyond some of the cutting-edge research in you know many situations some of you have gone beyond the cutting-edge research so yeah you know let's build from here as a community and anything that Rachel and I can do to help please tell us because we just want you to be successful we want the community to be successful so yeah one question so will you be still active in the forums very active okay yeah yeah that's that you know my job is to make you guys successful okay so thank you all so much for coming and congratulations to all of you