 I am recording now so but please keep talking. Don't be shy Zirada. It's okay. Oh getting a bit noisy out here with street cleaning something. Oh good. I didn't hear anything so I mean I say street cleaning it's more like foot path cleaning. We have a walking path along the front of our house. Sure yeah. Oh come on. I start by pressing the recording button and everybody stops talking. Well you know I don't want to just hear my voice all the time on these recordings guys. The things I wanted to cover in today's session but then the responsible part of me says I probably ought to create a lesson before Tuesday's class so maybe we'll do that. I've got a question Jeremy. I had to leave before you finished that code change yesterday did you? Was that actually do you want to recap on where we got to with waiting? Probably not because you can just watch the video and so like otherwise I guess we're just doing it twice. So is it working now? Yeah yeah it's all good you know I mean it's it's the concept is working correctly in terms of the code. We didn't like get a better score but I didn't particularly expect to either. You know maybe after next Tuesday's lesson we'll revisit it because I actually think the main thing it might be useful for is what's called curriculum learning which is basically focusing on the hard bits. It looks like Nick's internet still isn't working but Nick was saying the other day that he looked at which ones we're having the errors on which is like what we've what we look at in the book like looking at the classification interpretation and looking at like plot top losses and stuff and he said like yeah all the ones that we're making that we're getting wrong are basically from the same one or two classes. So I haven't done much with curriculum learning in practice like I like all it means in theory is that we use our weighted data loader to weight the ones that we're getting wrong higher and whether that will actually give us a better result or not I'm not sure but that I think that's more likely to be a useful path than simply reweighting things to be more balanced because we don't want things to be more balanced because the ones that we observe the most often in the test set are actually the ones we want to be the best at you know. I will say I didn't check whether the distribution of the test set is the same as the training set if it's randomly selected then it will be and if it's not then that would be a reason to use a weighted data loader as well. Okay so what's the difference between I guess like what's the is it is the curriculum learning kind of related to boosting and conceptually? Not really I mean maybe. So boosting is where you calculate the difference between the actuals and your predictions to get residuals and then you create a model that tries to predict the residuals and then you can add those two predictions together which is if not done carefully as a recipe for overfitting but if done carefully can be very effective. Yeah where else we're talking about something which is conceptually very different which is saying like oh we're like really bad at recognizing this category so let's show that category more often during training. I'm missing a question. I guess I'm really kind of focusing on examples you're getting it wrong like more and conceptually doing something similar. I was just going to ask are the labels ever wrong like by accident or intentionally in Kaggle? Of course absolutely. So are both intentionally as well? No not intentionally I mean I mean not normally like sometimes there might be a competition where they say like oh this is a synthetically generated data set and some of the data is wrong because we're trying to do something like what happens in practice but we can't share the real data. So is there any advantage in trying something like some uncertainty values from something like MC dropout try to find like a threshold of things that are too difficult and then potentially they're wrongly labeled? I'm not sure you would need that like the thing we use in the book and the course is simply to find the things that we are confident of but we know we're wrong but turned out to be wrong and then just look at the pictures. So the raw softmax value is enough you think to basically know whether or not. I do yeah I mean that seems to work pretty well I mean that the the only thing is you would need to be able to recognize these things in photos but I'm sure if you spent an hour reading on the internet about what these different diseases are and how they look you would be able to pick it up fast soon enough and then you know just like we did in chapter two for recognizing the things that aren't black and brown teddy bears. Okay but so plausibly even just knocking out some of the extremely difficult examples might get you higher on the leaderboard purely by virtue of them misleading them. Not by knocking out the the hard ones but by knocking out the wrong ones yes unless the test set is mislabeled consistently with the training set in which case you would not want to knock them out because you would want to be able to correctly predict the things which people are incorrectly recognizing as they're on a disease. Something to try though. Yeah yeah so I would do exactly what we did in chapter two you know you can use exactly the same widget but as I say you'd have to probably spend an hour learning about rice disease which probably be a reasonably interesting thing to do anyway. I just saw a link there's a discussion in the patty some people identified there's some mislaboring at least over 20 already. Oh okay yeah so definitely happened. It says we have manually annotated every image with the help of agricultural experts but there could be errors. Wow this person knows more about rice than I do. I think the images in the tongue grow have charts of issues. These symptoms can easily be confused with potassium deficiency. Fair enough. Is that an example of what you're talking about where if a layman or sorry for a semi-expert gets confused then the labeling in the test set is probably the same. Yeah exactly so you're probably fixing these would probably screw up your model because assuming that the test set was labeled used by the same people in the same way. I mean sometimes test sets the test set is more of a gold standard they'll make more effort to talk to like a larger number of higher quality experts and have them vote or something. Honestly this competition seems like it doesn't even have any prize money attached so I'd like I think it's really low low investment probably and so I doubt they did that but that can happen. Yeah the test set could have I mean it makes sense to invest in getting really good labels for the test set actually. I was looking at one of the other competitions on UNIFESP the x-rays and I think there there was one somebody had identified that a risk was wrongly labeled as a. Is this a current one Brian? Yeah it's not a there's no money again but it's been running for a little while. What's it called? A UNIFESP UNIFESP. Oh right it's another community competition. Gosh it's not very popular why is there only 74 teams? Yeah sorry go. I don't know I just I was just looking around and it looked interesting so I'm number 15 at the moment but it is a slightly weird one because well it's interesting because some of the x-rays have multiple labels but the labels are just concatenated so there's interesting discussion on how you then analyze that would you treat a combination as a distinct classification whether it was like a neck and a chest or something or do you look at each of them individually and then try and label a multiple one from the different things so some. Okay so I'm just having a look at this competition so when does it close? This is a month to go but I don't know exactly when that is. Normally there's a light 31st. Okay where do you see that? When you go to the bottom of like on the overview and it says there's a whole timeline so then you just hover over the. Oh my god I see it says close in a month but you actually have to get a tooltip by hovering. Okay thanks Tanishk. That's strange UX. Okay so we've actually got more than a month so maybe next week we could have a look at this one because it would be a good opportunity to play around with medical image stuff because they're using DICOM I think. Yeah somebody is also which I used supplied a library of PNGs which made it easier to use but I don't know what you lose in using that rather than the DICOM images. Well it rather depends so DICOM is a very generic file format that can contain lots of different things you know one of the things DICOM contain is is higher bit depth images than a PNG allows so if they've yes they might they might have gotten rid of that. Fast AI hasn't nice medical imaging it's pretty small but like has some useful stuff medical imaging library which I think is fastai.vision.medical which can handle DICOM directly and I see there's a fast AI entry as well. Oh that'd be fun we should try this next week I see and there's the PNGs. Yeah I think the DICOMs come to about 27 gigabytes. Oh my god okay so the the PNG was quite attractive from that point. Yeah so one thing that you can do with DICOM is to compress them particularly using JPEG 2000 which is a really good compression but yeah people often don't for some reason so probably the first thing I'd look at in that competition to see look at DICOM and see is it storing 16 bit data or not and if it is I would try to find a way to resave that without losing the extra information which I think we've got examples of in our medical imaging tutorial. All right I'll take a look at that. All right I'm going to share my screen even though I don't know what I'm doing. I'm going to have to drop in a few minutes but I'll catch the rest on the recorder. All right thanks for this. Nice to see you. By the way I was looking at this Conv Next paper and gosh everybody congratulates transformers on everything. Vision transformers bring new ideas like the Adam W optimizer but guess who actually wrote the first thing saying we should always use the Adam W optimizer that would be silver in fast AI. I think that was years before vision transformers Adam W. There we go mid-2018. I read that paper last night and I'm just thinking like they kind of talk about how all of these things were already there right but they just rediscovered them like slightly larger kernel size and things like that which like begs the question why is that no one just done like experiments to you know to just tweak these things together? I mean it's I mean we do but nobody takes any notice because they're not written in PDFs you know. Is it I mean these benchmarks they're like the thing is that like a lot of researchers aren't good practitioners so they just they're not very good at training accurate neural networks and they don't know these tricks you know and they don't hang out on Kaggle and learn about what actually works and so but then the thing is like it's not always easy to publish like even if you did stick it into a PDF and submit it to NeurIps there's no particular high likelihood that they're going to accept it because the field research wise it's very focused on theory results and you know things with lots of Greek letters in them. Does that mean that the part of the problem is that the data sets the benchmarks are just too inaccessible to the average person? No I wouldn't say that for ImageNet 1k. No I wouldn't say that the issue is I think the culture of research is not particularly interested in experimental results you know. With my limited experience I will say it's very hard to find reviewer as well especially you have a very strong domain not just running all the sample data set you can find in open source when you cause domain and then a lot of peer reviewers they're just not picking up to review it even if we pay for the reviewer we use them so people can get it for free and we take us three months just to find reviewer. Jeremy so on the topic of papers one do you know when the papers were treated given the situation? I mean it don't like until I mean I'm very fond of like papers that describe things which did very well in an actual competition you know that then we know this is something that actually predicts things accurately. You know you can get similar results if they've got a good you know just table of results so generally speaking I like things that actually have good results particularly if they show like how long it took to train and how much data they trained on and yeah so are they getting good results using less data and less time than you might expect from the same thing and yeah I certainly wouldn't focus only on those that get good results on really big data sets that's not necessarily more interesting I'm very interested in things that show good results using transfer learning so I look for things that are like practically useful I don't train that much from random so I'm very interested in things that do well on transfer learning I'm also like look for people who you've liked their work before you know and and in particular that doesn't mean like always reading the latest papers you know if you come across a paper from somebody that you find useful go back and look at their google scholar and look read the older papers see who they collaborate with and read their papers so for example I really like Kwok Lee and Google Brain his him and his team do a lot of good work it tends to be you know very practical and high quality results and so we know when his team releases a paper I and I also know like he seems to have similar interests in mind like he tends to do stuff involving transfer learning and getting good results in less epochs and stuff like that so if I see he's got a new paper out I'm pretty likely to read it I have a question in I mean for for the the category competitions and like like in a lab type of environment is I mean when to the question that I have is when to stop iterating on a model on a model that you have is is I have the someone asked me when is enough enough to do the training on the data that you have when is enough so that question I mean there's some reason you're doing this work right so like you hopefully know when it does what you want it to do I mean the thing that happens all that that happens especially to me all the time is that we train the model and it works perfectly fine on on the lab when we're doing it and then as soon as we throw a couple of images that they are not part of the set I mean that goes nuts and okay because that's like light or more light or the temperature is different or stuff like that so that's a different problem right so that that means your problem is that you're you're not using the you know the right data to train on so like you need to you you need to be thinking about how you're going to deploy this thing when you train it and if you train it with data that's different to you know how you're going to deploy it it's not going to work yeah so that's that's what that means and it might be difficult to get data enough data of the kind you're going to deploy it on but like at some point you're going to be deploying this thing which means by definition you've got some way of getting that data you're going to deploy it with so like do the exact thing you're going to use to deploy it but don't deploy it just capture that data until you've got some some actual data from the actual environment you want to deploy the model in you can also take advantage of semi-supervised learning techniques to then you know and transfer learning to maximize the amount of juice you get from that data that you've collected and finally I'd say like let's say for medical imaging like okay you want to deploy a model to like a new hospital they've got a different brand of MRI machine you haven't seen before I would take advantage of fine tuning you know each time I deployed it to some different environment where things a bit different I would expect to have to go through a fine tuning process to train it to recognize that particular MRI machine's images but you know each time you do that fine tuning it shouldn't take very much data or very much time because it's you've your models already learned the key features and you're just asking it to learn to recognize slightly different ways of seeing those features yeah I don't think you'll solve this by training for longer you know you'll solve it by figuring out your your data pipeline your data labeling and your rollout strategy usually the the issues that we're having is that we don't have enough data of of certain category but I mean the thing that you did yesterday it resolves a little bit of that problem I think we're going to start using yeah well also like if you don't have enough data of some category don't use the model for that category you know so like you know rather than using softmax use binary sigmoid you know as your last layer and so then you've kind of got like a probability that x appears in this image and so then you can you can recognize when none of the things that you can predict well appear in the image and so then have a you know you always want a human in the loop anyway so when you didn't find any of the categories of things that you've got enough data to be able to find then triage those to to human review one thing that we did is I mean we have like 50 something categories oh just one moment hang on yes sorry about that we had like 50 categories and and some of them are like they have a lot like 10 of them have a lot of items so we end up doing like in a three-step kind of process like the ones with a lot the ones with medium number yeah with a smaller number and it looks like you resolve the problem a little bit cool but uh this was to classify metadata coming from from other systems and classify for legal purposes for legal retention I see got it tell me I had a question actually you tried the weighted data a lot else right so I think you have it submitted that to Kaggle notebook so did you do any validation locally first before submitting to Kaggle uh something like that so you have a set up no I mean you saw what I did right and when I did it so I just yeah I just I like I was intentionally using a very mechanistic approach um because it was part of like just um yeah showing like here's the basic steps of pretty much any computer vision model which is entirely mechanical and doesn't require any domain expertise so yeah my question more was like shouldn't we always treat the public leaderboard like I say good or like should we take a hold out local data sets first to validate I yeah so I I mean I always have a validation set yeah um which we saw in this in this I just used a random splitter because as far as I know the test set in the Kaggle competition is a randomly split validation set yeah so like whether it be for Kaggle or anything I'd think uh creating a validation set that as closely as possible represents the the data you expect to get in deployment or in your test set is really important and yeah I actually didn't spend the time doing that on this paddy competition normally on Kaggle if somebody does and notices there's a difference between the private leaderboard and the public leaderboard like the test set and the training set normally it'll appear you know in discussions or on a Kaggle kernel or something um which is hardly a way I didn't look into it but yeah I mean you should probably check it doesn't have the same distribution of um disease types you know from the predictions that you create um do the images look similar do they have the same sizes and for me if as soon as I see any difference between the test set and the trading set that puts my alarm bells on right because now I know that it's not randomly selected and if you know it's not randomly selected then you immediately have to think okay they're trying to trick us so I would then look everything I could for differences because it takes effort to not randomly select a trade a test set so they must be doing it very intentionally for some reason quite often for wrong reasons it turns out in practice I think so like I don't think a Kaggle competition should ever silently give you a systematically different test set I think there's great reasons to create a systematically different test set but there's never a reason not to tell people so if it's like medical imaging it's a different hospital you should say this is a different hospital or if it's fishing you should say these are different boats or you know because like you want people to do well on your data so if you tell them then they can use that information to give you better models and so Kurian like going back to what you asked about there's this validation in training then there's this whether your local validation maps to what's happening on the leaderboard the score on the hidden test set but there's one other scenario that I encountered recently and maybe it would be interesting for someone when you're working on a competition sometimes you might mess something in your code or the prediction you know your model is still doing something useful but you're failing to output a correctly formatted submission file and not in a sense that the submission phase on Kaggle but some predictions are not aligned where there should be or you know there for a different customer ID or stuff like that so once you have one good submission file relatively good you can just store it locally and then see you know run a check the correlation between your new submission and the one that you know that this is okay and you know the correlation should be upwards of 0.9 and then you know yeah okay so I didn't mess up anything with the technical aspect of outputting the prediction I mean it's not a great strength but you know I was like putting my hair out why is it not working as a better model so this was like a sanity check step maybe at some point in the next. Thanks cool. Thanks. All right so let me share my screen. Let's find zoom. Zoom. Share screen. Oh that's not the right button. Control up shift H okay. Where did we get to in the last lesson we finished random forests right and oh that's right and I haven't posted that video yet. That's okay we can check the live fastier live okay so we did small models and did we get to the end of this okay so that basically so we basically finished the second one of of our Kaggle things so next week see what's in part three gradient accumulation I think that's worth covering so one thing that somebody pointed out on Kaggle is I've actually I'm using gradient accumulation wrong I was passing in to here to mean to make create two batch like do two batches before you accumulate but actually what I meant to be putting in here is the kind of target batch size I want so that would be actually I should be putting 64 here so I feel a bit stupid so what I've been doing is I've been actually not using gradient gradient accumulation at all I guess it's been doing a batch and saying that's over I'm saying my maximum batch size should be two okay so this has actually been not working at all that's interesting whoops so I've been using a batch size of 32 and not accumulating ha okay so that's one thing to note so when I get Kaggle GPU time again we'll need to rerun this actually it only took 4 000 seconds so I guess we should we could just get it running right now couldn't we so that should be 64 how many paths defines how large the effective batch size you want is over batches oh we can just remove this sentence entirely oh no that's right we divide the batch size by some number based on how we need it to be for our GPUs around okay so and on Kaggle I think these are all smaller I don't know why but the Kaggle GPUs use less memory than my GPU for some reason okay so we're now let's try running it so Jeremy if you would increase that I've combed number until no longer get CUDA out of memory yeah and you could be able to pretty much guess it's by looking at like I mean you can just once you've found a batch size that fits you know so the default batch size I believe is 32 so once you find a batch size that fits sorry 64 is the default a batch size it fits you just like it's like okay well if it's in 32 then I just need to set it to 2 because 64 divided by 2 is enough and the key thing I do here is you know so I've got this report GPU function so what I did at home was I just you know changed this until it got less than 16 gig and as you can see I'm just doing like a single e-park on small images so this ran in I don't know 15 seconds or something yeah batch size 64 by default yeah so then I just went through checking the memory use of confnext large with different image sizes again just keeping on using just one e-park and that's how I figured out what I could do to set the QM to for it to work all right so that should be right to save and run and then run off this one so when you're running something like you click save version and you click run you'll then see it down here and that runs in the background you don't have to leave this open and so you can go back to it later so if I just copy that you can close it and if I go to my notebook in Kaggle this shows be version three or four because version four hasn't finished running yet so if I click here I can go to version four and it says oh it's still running and I can see here you go it's been running for about a minute and it shows me anything that you print out will appear including warnings so that's yeah that's what happens in Kaggle so if we also do the multi-objective loss function thing that would be cool so I thought like next time in our next lesson broadly speaking gosh this is taking a long time I kind of want to cover like what the inputs to a model look like and what the outputs to a model look like so like in terms of inputs really the key thing is embeddings that's the key thing we haven't seen yet in terms of what model inputs look like for model outputs I think we need to look at softmax put softmax cross entropy loss entropy loss and then you know our multi target loss which we could do first kind of a segway so maybe in terms of the ordering the segway would be like doing multi target loss first and we could talk about softmax and cross entropy which would then lead us potentially to like looking at the bare classifier what if there's no bears so we can just use the binary sigmoid so then for embeddings I guess that's where we'd cover then collaborative filtering because that's like a really nice version of embeddings so I guess the question is for those who have done the course before are there any other topics I guess like time permitting it would be nice to look at like the conf net what a conf net is just kind of so that's like then we've got like the outputs the inputs and then the middle what about more nlp stuff like what well I've heard that um hugging face is going to integrate it with pastai maybe looking at that how it works well it's not done yet so we can't do that yet but definitely in part two um I've got a question I don't know if it's helpful but um there's a lot of emphasis on outputs and inputs um but like in the middle just understanding like the outputs of a hidden layer whether they're going to rise or not how do you kind of debug that how do you understand you know when to to kind of look at that yeah very helpful um last time we did a part two we did a very deep dive into that and I think we should do that again in our part two because like most people won't have to debug that because if you're using an off-the-shelf model you know like it's you know with off-the-shelf initializations that shouldn't happen so it's probably more of an advanced debugging technique I would say but yeah if you're interested in looking at it now definitely check out our previous part two um because we did a very deep dive into that and developed the the so forth colorful dimension plot which is absolutely great for that um yeah yeah so that would exactly so collaborative filtering would lead us exactly into that thank you so yeah sorry Serada um do you might as well finally talking about the importance of the ethical side uh at least you point to the resources um Rachel prepare before so I think people because it's so easy to build a model but how to apply is getting more scary now yes yes I mentioned in lesson one the data ethics course but you're right it would be nice to kind of like touch on something there wouldn't it um extra by Rachel from part one before that was that was a great lecture yeah I mean that I mean okay I mean that actually would be a great thing just to talk about you know that that lecture is not at all out of date so yes so maybe touch on it in this one and also talk link to you know for for varying levels of interest um the two hour version would be Rachel's talk in the 2020 lecture and then deeper interest deal would be the yes the for ethics course it's a great point thank you so then um for for actually pretty much all of these things um we have excel spreadsheets which is done so there's let's have a look collaborative filtering oh looks like I've already downloaded that huh generally I would encourage you to continue teaching in excel um yesterday I on the panel in a data science conference and when I mentioned I start with excel actually inspire a lot of people they want to have a go with data science and learning it so oh that's good feedback yeah because there's a certainly some people who don't find it useful at all um and they tend to be quite loud about about it so um it's certainly nice to hear that that feedback um what nobody didn't bother you Jeremy sorry so I thought you didn't let those people get to you oh I only pretend that anybody doesn't get to me I was going to back up sir I don't just say that's that was really great to see um I've never seen it done once before um and that was in uh a physicist in Belgium who explained radiative transfer modeling using excel but it was just so nice to see the clarity and let's go great okay thank you I will um let's see so we've got um so I think these are actually from the 2019 course fast AI one course is DO1 uh so I'm just going to grab them all so one thing I don't think we're going to cover this year this part one that we will cover in part two is like different optimizers like momentum and atom and stuff but I think that's okay because I feel like nowadays just use the default atom w and it works so I don't I think it's fine not to know too much more than that it's uh it's uh a little bit of a technicality nowadays yeah like you mentioned Jeremy and it used to be something we did in one of the first lessons you know but um that was when you kind of had to know it right because you always fiddled around with momentum and blah blah blah to me always like the biggest thing when starting on something is to how to data you know once I figure out how to read in the data then think so I'm really grateful that there's such an emphasis in this addition of the course on on the reading of data and you know with similar data that is something that that we also stay on the lookout for just understanding better reading the data great I don't think we did this one anymore because we kind of have better versions in in Jupiter with iPod widgets so we've got this fun convolutions example which I think is still valuable um okay we've got softmax and cross entropy examples and we've got collaborative filtering that's something interesting I wonder what that is and then also we've got word embeddings all right embeddings are such a cool and important subject and it's something that we haven't discussed it much in this course no I mean we haven't touched them at all great all right um it feels like a lot to cover that we will we will do our best okay I think we're up to our hour so thanks everybody nice chat today and I will get to work on putting this together have a nice weekend thank you so much just remind everyone um there's a race and bias um video today um I think six o'clock um basement time so with anyone interested um the guy mentioned he Thomas mentioned you're going to have another u.s session as well we can join just yes I think there's details on the forum yeah thanks see you