 So you've got a new Paddy submission. Let's take a look. Kaggle competition. By the way, it's really beautiful to see over the last week or two, all these fast AI people just pop up at the top of that leaderboard. It's so cool. Okay. Fast AI, fast AI, fast AI, fast AI, fast AI. Who's this person? Is this fast AI? At least the top five. Yeah, like most of the top five or top 10 are following you in these walkthroughs. You've all got the same score though. Somebody's got to like, you know, Kurian's got something. Secret source there. Well, I've got a few ideas I can show you guys today if you want to try and take it a bit further, which I bet you do. Anybody have any comments or questions? In the meantime, share screen, the right screen. And I'll move you guys onto the other screen. And now I can see. Yeah. All right. So, Paddy leaderboard. There we are. Where's Radick? Not here. Sirata, I see. One thing that I guess it would be nice if it wasn't so sort of, I don't know, a little bit of a mess because I set this up in paper space and then started running it. And then I went to bed because it was taking so long. And I just have a fear that if my browser sleeps or goes to sleep that it'll just basically stop the session, even though there's more hours and process in the work session is running. I wasn't sure. I mean, it, it shouldn't. But what happens is it queues up, you know, for when your browser comes back. But the problem is there is some limit to how much it'll queue. So although it'll have run, if you've hit that limit, you won't see all the outputs, which is nearly just as bad. Well, there's a few things you can do. The most obvious one would be to use NB dev to export the notebook to a script, and then run the script in Tmux. Because then you can close it down, come back, reattach Tmux and there it is. Okay, that'd be interesting. Now something. Yeah, so maybe we'll look at that sometime. Don't. Well, does paper space gradient that you have doesn't let you SSH in with a suitable IP? I'm not sure. If you've got your own GPU at home, you know, or on AWS or GCP or whatever, then what I do is I run XRDP on it, which is a remote desktop server. And then I can connect to it like so and run Firefox. And so this, this is my, yeah, this is a, this is my servers screen, you know, remote desktoping in. So if I now go in and run something, Hattie, I remember from last time. Okay, so I can set this running. And then I can close it down, go to sleep, come back the next day, reconnect to that screen, and it's still been running. So like that's the, that's a preferred way to do it. And I, yeah, as I say, I don't know if it's possible on paper space gradient. So go on. Machines seem to have a limit of six hours that I've seen so far. If you subscribe to their pro or whatever. You can bump it up or get rid of it all together. It's this tab here, machine tab, you can change the auto shutdown. Okay, looks like a week's the maximum. Oh no, there's no limit there as well. When you're paying. When you're paying, but I mean, you know, it's, I think it's like eight bucks a month. Yeah, eight bucks a month. You may as well. Yeah, I've got the pro, but I don't have on the when you pick a free machine. Oh, yeah. Right. Free P 5,000. Six hours. Yeah. So Jeremy, yeah, sorry to interrupt paper space in there. In the support channels, they, they talk about you can assign a public IP to a machine and then SSH to it. So you could SSH and then T marks. Is that a gradient machine though? Well, good question. I'm not sure that it would be. So, instructions. No, it's not. And so they also have this thing called core, right. It's more like AWS or Google servers, which absolutely lets you do a static IP. And you don't even need, I don't even know if you need a static IP necessarily, but you could use a dynamic IP. We've worked just as well. Bit cheaper. The thing is though, I reckon they're pretty expensive. Yeah, or product. So that these are very basic GPUs. So that's not bad 45 cents an hour. I guess they're not too terrible. If you want to TX, I don't know, I guess they're the same price really 56 cents. So I take that back. I guess the thing I found expensive was the CPU pricing for running it all the time. So tell me with this RDP solution that you showed, how does that work? Do you have an? Oh, just a moment right at close here. Okay. So how does it work? I didn't get to where what computer your RDP into my own GPU machine, but it could just as well be a AWS machine or GCP machine. This is basically the same as VNC. If you've come across VNC before RDP is the kind of Microsoft version of that. I like it generally quite a lot better. And much to my surprise, the Mac client RDP is better than the Windows client RDP even shows you a little mini screenshot, you know, of the screen. So yeah, this is now finished training. No, no, nearly finished. That's halfway through training, whatever. What's this tricky to set up because you're running a Linux server? No, not even slightly tricky to set up. So, yeah, you just, it's called XRDP since it's RDP for X Windows. And you just go apt install. Yeah, I hate installing this kind of thing. It drives me crazy, but this is it. You just sudo apt install sudo add user sudo system CTO restart. And then you might also want to run sudo system CTO enable, which will cause it to automatically start when you start your computer. And I don't think I, oh, you know, if you've got a firewall, you'll have to let it in. So it's port 3389. Basically this line of code and I think I did have a firewall so I was around this. Yeah, that was it. It just used my username and password that I had on the machine. Yeah. Very surprisingly, not annoying. And then I think I just installed Microsoft remote desktop from the Mac app store or on Windows. I think it comes with Windows. So that was easy. Yeah, nobody seems to talk about it much. People mainly talk about BNC, which is also fine, but tends to, I find it a bit slower and a little bit more awkward. All right. I mean, one weird thing, I guess, is I guess my machine and this is pretty common. I haven't set up really to be a graphical workstation. I always use it from the console. So I actually don't really have much of a window manager here. I can't even like, oh, no, I can do a little bit. I don't know. I don't know what the hell window manager is even using. But often you'll find like there is no window manager or whatever running. But you know, a bit of Googling will show you how to app to install. You know, whatever, Katie or stuff. Okay. Since we're on the solution topic, could I ask a question? So I think I kind of brought it up a little bit, but I can't launch fast AI, a machine that runs fast AI and pytorch. A pytorch one would work. So what suggestions would you have about that means that your pre run.sh files got a problem. So maybe commented out from my pytorch just started. Open up a pytorch machine. Move pre run.sh to pre run back or something. Or just open it and see you like it might be obvious what's wrong with it. Yeah, I couldn't see anything. Well, when he says it's not working, what's like, what's not well just says error when I try to start it up just says error and I tried to reach out to the paper spread space support a couple of times. Maybe it's a two abstract questions. But I'll try that. Oh, people are putting stuff in the text chat. Please try to say things verbal chat if you can, because it's way nicer for me and I don't have to check multiple windows. I know it's not possible for everybody. Okay. So sorry, Jeremy, but there is a way to ssh into a gradient machine, but you have to trigger the virtual machine to be built from the command line. So you have to initiate the job and there's a space have a GitHub repo. And is there any reason to do that like that sounds complicated like we're just way more effort that it's worth for just run a court just run a paper space core machine if you want to I guess. Yeah, exactly. You can do it. It's just, why would you. So yeah, I mean so for paper space the issue around the notebook closing, I would like start running something close the notebook and then reopen it just to see what happens. You know, and you know, let's try it here right so what was that thing we learned the other day it was shifting and go to the other one. Well, that was my one. Okay, I got to learn how to Hey Jeremy. Yeah. I'm using item to because you can do team up minus CC and you'll get native windows in team up instead of the little sort of terminal ones. Sounds interesting. Let me try that. Yeah, I'm addicted to the battles. Minus capital capital minus capital CC. Unknown option C. Before the A. Yeah, so it'll be team ups minus capital couples. Yeah. Okay. And what are the benefits of this approach? They're native windows. You can click and drag them and move them around pop them out. Yeah, all that stuff. Click and drag. Team up windows as well. Okay, this is all the same as like you've got to have mouse mode on for them to work. So like the short cut he's like command shift D will split pains you don't have to go into I think is a colon or something and command something. It's just like let's do me just use control B. And maybe it's exactly the same. Yeah. I mean you have the same. It doesn't work anymore. So I'm a team up shortcuts. I'm not going to work anymore. How do I do that now. It's a different escape, I think, or if you go back to the original window that launched it. Yeah, I'm not convinced it's going to help my workflow but I think yeah for people who are more familiar with team up shortcuts that could be cool. Thanks for the tip. What's going on down here. The, the trick to get mouse support working so for example my scroll wheel as you can see works nicely in this normal team up window is to have a dot team up dot confile that contains set option minus G mouse on. And then you can also increase your history limit. And yeah, that's how come I can scroll. I think the thing like, you know, or a thing I like about T marks is it's very integrated with my kind of the normal way of doing things in, in Unix, you know, so for example, if I want to search through my previous session, I could just hit question mark to search up and I could search for make file. For example, and I, you know, hit and just like I would in Vim. Hit slash to look forwards. You know, it's like my terminal works the same way as Vim or whatever which I. Yeah, which I really like. And I think, yeah, that way I don't have to know, like, oh, the items that are shortcuts and some other set of shortcuts is just this kind of like general unixy way of doing things, I guess. And of course they'll also all work on the paper space terminal as well. Yeah, so let's try this. So if we start running this, close that for a few seconds. You can see here it says in my console, starting buffering. So it's remembering things that were sent to me. So if I click now back here, there we go. It's, let's see, that didn't seem to work. That's interesting. Okay, so let's try something different. So I don't think you can just close it and reopen it. Let's try something else. What if we fake a network disconnection by closing SSH. Okay, so now our connections failed. So I'll leave that window open. And then we reconnect. Okay, so that worked. So there's some of her answer. But yeah, I think there's something now if you leave it long enough. It says I've stopped listening for events because there's been too many and tells you there's some configuration option you can change to make it bigger. Should probably be a useful thing to know about. Let me just go and turn this alarm off. Hang on. Sorry about that. My daughter likes to be permanently entertained. So any gaps in her homeschooling schedule. She wants to be amused. She doesn't like the fact that I'm doing this and Rachel's a crossfit. Okay. So we had a look the other day at progressive resizing, right? And so this is where I got to, I think like progressive resizing. What interesting thing you can do is like you can go crazy. Like you can go extra large. And, you know, we start out with some teeny tiny images train for a while. And then combine that with gradient accumulation to then go up to big images. But don't have to train so long. So this, I think this is a good trick for probably particularly for code competitions on Kaggle where you've got serious resource constraints, you know, or just wanting to do more with less time. So I think, yeah, on Kaggle you would have needed accumulation level of four rather than two to make this fit because they've got 16 gig cards. We also got 24 gig card. So then something else that then we started talking about was weighted models. That's weird. What happened to my weighted model? Did I move it to course 22? That's fine. So the question I think we had yesterday was about unbalanced data sets and would it be a good idea to balance our data set? So let's start with a nice small model to use as a base case, something we've done before. ConvNext. Okay, let's use this one. So actually there's no point copying progressive, I guess. Let's copy more models. Okay, rename. And so this is going to be for weighted. It may as well do the resizing. I don't really need it on my machine, but since we'll be putting it on Kaggle, it may as well. Okay, so that's going to be our base case. So for weighting, we can df.label.valueCounts. So there's our level of unbalancedness. So it's not too bad. There's a lot of normals, a lot of blasts. Not many of these are bacterial thingies. Nick, I don't know if you're around. I mean, I can see you are around. I don't know if you're able to talk. But if you are, you could maybe tell us about what you found because I know you've been looking at these, which of these are hard to kind of visually see the difference between? Yeah, yeah, for sure. I'm sorry, I dropped out earlier because we had a power cut here, but I'm back now. Are you intentionally video-less? I'm not intentionally video-less, but that's the break at the moment. Sorry about that. No worries. But yeah, like one thing that I did just to, I guess, get a better handle on the dataset was going through them and having a look at the different types. I found it really hard to pick even what the difference was between a normal image and say like downy mildew or whatever. It could be quite hard to pick out. And so one thing I thought that would be fun to do is to almost like segment or mask the images playing with the color channel to see if they would come out a bit better. And then when I did that, I was able to take kind of, I guess, the yellow dead bits or disease parts and I could see them better when they were like in bright red. And the thing is that so many of these, like when I found like when I've trained them, I find that there is a handful of images, like like 20 to 25 images that are very difficult to classify. And it tends to be these actually from these imbalanced classes where it tends to categorize them as blast when it's not. And I think your ones tend to get. Yeah. In fact, let me just pull up in one of my notebooks. Do you want to maybe share your screen? Yeah. Let me see if I like, where do you look at this? Are you able to see? But it helped to make these bigger. Are you able to see the disease in these? Because I don't know what I'm looking for. How do we make this bigger? Probably there's like a figure size in that plot lib, isn't there? That plot lib. Is it like a fig size? Fig size? Yes. Fig size equals, I don't know, which way round is it? We can't hear you, by the way, Nick. I don't know if we lost you. Jeremy, I also tried to look into the image. Yeah. Using the confusion matrix and then the most loss to pot it out. But it's just too hard. It's been beyond my domain. I was planning to do that today, actually. So that's, yeah. I don't know what happened to Nick. Maybe he's having some internet problems again. I wonder if it's just like red spots or something. So yeah, I mean, it's anyway, it's interesting that Nick said he found these ones difficult. So yeah, there's basically two reasons to wait different rows differently. One is that some of them are harder and that you want them to be shown more often to give the computer more of a chance to learn them. And the other is some are less common and same thing. So, you know, one possible waiting for these would be to take their reciprocal. And so then, you know, normal is going to be shown less often if we wait all the normal ones by this amount and all the bacterial panicle blight ones this amount, you're going to get more of these. So that's like one approach we could use. I feel like that might be overkill. So I'd be inclined to kind of like not do it quite that much. So like another approach would be to like take the square root, maybe one over the square root, kind of like that. So then these are going to be shown about twice as often as these, you know. So maybe like let's start with trying this as a set of waitings. Jeremy, I could ask a question at this point. So the waiting and when you talk about waiting such that images are shown more or less often. Yes. Under in cases where it's it's very imbalanced. Whether that could lead to some classes being overfitted to because the mobile learns about the images themselves. Yeah, definitely. And whether there was a way to I read about out there how to deal with imbalances. And I've seen some recommendations to try to wait when they calculate the way when calculating the losses, rather than resampling the input. So I just wondered whether it was possible. I mean, they're different, right? Like, so in the end, you want it to be able to recognize the features of the images you care about. And there's no substitute for like having them see the images enough times to recognize them. However, when it does that, it is then going to because it sees the rare cases more often, it's going to think that those rare cases are more probable than they actually are. So you have to reverse that then when you make predictions. So that's, yeah, that's something to be to be careful of. So I mean, I mean, I think it probably just helped to try to try to take a look at it to see what that looks like. So yeah, so here's our weights, right? I would be inclined to probably, can we merge things directly? Let's take a look. So if I go df.merge, which is kind of like a way of doing a join in pandas. And the right-hand side, yeah, the right-hand side can be a series. Cool. So merge on weights. What does that look like? Nope. Why not? And then, okay, left, I see. So left, okay, so on, left, left on, left on equals label. And right, I think that's called the index. I'm not a pandas expert. I don't know if anybody is. There we go. Okay. So that's added these weights here. Given the slightly weird name, but that's okay. So if we called that weighted df. And so then we could take out a little function and move them over here. And I think what we want to do is use data blocks at this point. It's often a good idea. And we have a data blocks version. Certainly make one otherwise. Okay. Here's a data block. So let's get a data block. Got an image block and a category block. Get Y is parent label. Okay. Item transforms is this. Jeremy, I think you're in the wrong. That book should be weighted. Thank you. Yes, I had these here, but thank you. Okay. Okay. And batch transforms. We should just use the same ones we had here to make it fair. So there's our data block. Oh, we actually use this resizing. Jeremy. Yeah. Sorry. Sorry to interrupt there. So this approach is we're going to use the data block. Data block to even the numbers of what's being sampled so that we can more augmentations of the same images for the lower representative samples or. Kind of. So it's nothing to do with the data block. We're going to use things called weighted data loaders. And the weighted data loader is going to use these numbers here. To as basically like probabilities of how likely it is to pick that row when it grabs a row in a batch. Yep. I was going to add them all up and do each of these divided by the sum. So they'll add to one. The reason I needed data blocks is because the weighted data loaders method is a method of data block. It's not something we get in the, you know, quick and dirty image data loaders thing that doesn't have as much flexibility. So now that we've got a data block, we can type the block dot. Oh, and we'll have to import it. Import fast AI dot call back dot. What was it in again? I don't remember fast AI weighted data loader. It's a data callback. Oh, okay. So that's it's a. It's actually a method of data sets. So we can get a data sets object from a data block. Like so. And we pass in source. So that would be our list of image files. So we can files equals get image files in our training set. You pass those in. And there's our training set and there's our validation set. So they're data sets. So these are the things that remember we can index into and get a single XY pair. And so weighted data loaders is then something we can pass data sets to and give it weights and a batch size. Okay. And the weights are for the training set. Okay. We're going to have to be a bit careful about this. So we should be to go DSS dot weighted data loaders. And so the source code. Yes, it calls weighted DL, which is here. Weights call and weights. All right. I'm not 100% sure how this is going to work, but let's try it. So our weighted data frame. So this is the weight for each row, right? And then we've got our files. Yeah, we're going to be a bit careful here, right? Because they're in different orders. So we actually need a way to get a list of weights where the two orders are going to match each other. You do it by key lookup. Can you put a key lookup? Yeah, we could do it by key lookup. I'm actually thinking of something a little lazier, which is just to sort them both. Okay. So although this seems to only have a what's going on here. It doesn't have them all. Are they not contiguous? What values by image ID? No, they are contiguous. So where is image 1001? The sorting must be by folder first though. Yes, of course. That's exactly what it is. Thank you. Okay. So we could use a key. That looks hopeful. It says here if the key is a string use attribute get us. I think I can just pass in the key name. Ah, that is magic. That is the magic of fast call right there. There we go. So that's sorting by name. And we can do the same thing for this one like so. And so now they're sorted by the same thing. So that's a good step. So the weights are basically WDF dot label Y. Now that's a pandas series. Which yes to numpy would turn it into an array. So I'm just not quite sure whether this has to be just for the training set or is for both. We'll find out in a moment. If I run that. It doesn't like it. It's interesting. Ah, of course. So the batch transforms actually didn't end up getting applied because we use dot data sets, which doesn't apply batch transforms. So we would need to now apply them here. But that's quite confusing. So presumably, I don't see it here, but I would expect to be able to go batch transforms at this point. This is all quite awkward, isn't it? So data loader keyword arguments equals. So if we are creating a data loader, a weighted data loader. You know what would be a good idea would probably be to look at the data block dot data loader's source code to see how that does it. Here we go after underscore batch is what it is after underscore batch. Okay, that's not it. Let's see. Okay, it's calling dot data loaders passing in the keyword arguments. Okay, dot data loaders does not call it after batch. But data loaders, well, that's dot data sets. Yeah, so okay, so data sets dot data loaders is this thing here. And that doesn't equal it after underscore batch. So, and I think I know why. I think that's because when we looked the other day at data block, we noticed that it like adds. Oh, yes, yes, yes, the image block that adds int to float tensor as a batch transform. So we might need to add that as well. Okay, so it's getting PIL images. So the fact is getting PIL images means it's never being converted to a tensor. So data block. I really think there's something that calls to tensor or something at some point. Oh, there is here item transforms. So why isn't that getting called? Because, oh, item transforms, I think are also done at the data loaders stage. Item transforms, let's see, item transforms. Yes, that's also done. Okay, so basically using data sets instead of data loaders is quite awkward. I think we need to fix this in fast AI because yes, it's not being done for us. But you know what we could do, actually, is what we could do is the same thing that data block does, which is just to use these self.item transforms and self.batch transforms. So if we have a look at our data block. Oops, there's a T block. Okay, I think this is all going to become clear in a moment, hopefully. It's got these item transforms in it. And it's got these batch transforms in it. And so what we actually want to do when we create our data loaders is say that after batch is whatever the data block says, the batch transforms are. And after item is whatever the data block says the item transforms are. Okay, that's ugly. So that's something I think we should try to make it easier. So hopefully by the time people see this video, this will all be easier. So there's some data loaders. Okay. So my guess is that here is we've given the wrong number of weights. I'm guessing this needs to be weights just for the training set. So the way I would check this is I would type percent debug and that puts us into the Python debugger. And the Python debugger is a very, very cool thing. It's called PDB. And definitely want to know how to use it. H gives you the help. And W shows you where in the stack you are. So you can see this is the line of code I'm about to run. And so I can print out with P self dot n. And I can print out with P self dot weights. And I can, you don't actually normally need to even say P. It just assumes it. So I can just say self dot weights touch shape. And so that's the problem. So it's expecting 8,326 weights, not 10,407 weights. And so that's because, and you know, to be fair, the documentation warned us about this. It's expecting weights just for the training set. Not for both training and validation sets. Okay, no problem. Did you predetermine you split both by adding another column in the same data set that you put the weights in? Yeah, I could do that. But actually, and somebody actually asked about this the other day. This is our training set. And items tells you the file names, actually. So we just need to look each of these up in the data frame. So what we could do is we could say weights equals. And so we could go through each of those. So that's going to be all of our files. And then we need to look up the image ID. And you know, I think something you could possibly do here is set the index to image ID, right? Which is this kind of pandas idea. WTF equals. And then we say.location of 100001.jpeg. Ah, there it is. And for label Y, there it is. So if we copy that over to here and replace that with O. Look at that. Okay, so we don't want to sort values. We want to set index. I should probably make more use of indices in pandas. I guess I still don't have a great sense in my head of quite how they work. So I tend to underuse them. Okay, so weights should now be the right length for the training set. Okay, so now our weights here. It's just weights. Cool. And then what I've been trying to do is to do a few more. And what I find encouraging here is that we've got a lot of bacterials around the spot. Yeah, you know, this seems like a good mix, right? So then we should just be able to pass those to our learner. Fine tune for five epochs. All right. Sorry, that was a bit more awkward than I would have liked and definitely used a whole bunch of concepts which we haven't covered before. So don't worry if you're feeling lost about the implementation here. Basically, Jeremy, just about the sampling works. We've got weights and that's creating how is that actually sampled from the training set? Is it do we have a X number of rows or number of images that we're trying to create a sample? Yeah, so what happens is it creates batches. So each batch will have 64 things in. And so it's going to grab at random 64 images, but it's a weighted random sample where each row is weighted by this weight. And so an epoch is not exactly an epoch anymore in that it won't necessarily see every image once. An epoch just equal to the total number of rows in the data set is how many rows I've seen. But we'll see a lot of the less common ones multiple times. And so there's a definite danger of overfitting. The weighted sampling is not done for the validation set. So we should be able to compare these. Let's take a look. So 5.6 versus 4.6. Now, you know, this is expected. But where this might be interesting would be like do all of our training and then maybe at the very end do a few epochs with weighted training. You know, at the point that it's already really good just to show it a few more examples of the less common ones. Or just train it for longer with more data augmentation. Yeah, I mean, you know, you would expect the error rate at this point to be worse. I think because the most common types, which it's particularly ought to care about because they're the ones that's going to have mainly in the training set. It hasn't seen very much. So the overall error has gone down. But yeah, I think you like it might they may well be ways to to use this. Jerry, yeah, it's possible you could quickly explain where the deficiency was in this random weighted API, how you would prefer that to look like you said you were. Oh yeah, sure. Fix it up later. But I mean, I think I think the way this ought to look would be that I can say DLs equals D block dot weighted data loader like that. In fact, you know, we could we could fix it up now. The reusing the existing after batch and after items already and then we could we can fix it up now if you're interested. Yeah, love to see how to commit change. So, you know, the first thing I do before I change the first day library is make sure I've got the latest version of it by doing a get pool. Because nobody likes conflicts. All right, it's up to date. So then I would go into the notebooks and it was in the data callbacks to call back dot data. And so here's weighted data loaders. Jeremy is this a bit of a silly question but is it a call back or is it just kind of like a transform within the actual data block should it be. So if you send weights to a data block, then it just does it. Is it a callback. No, it's not a callback. It's in a strange place. It's not a callback. What it is, it's a data loader, actually, and a patch to data sets. So there's a, you know, something I like very much in fast core called patch, which is allows us to add a method with this name to this class. And I want to add something to the data block class like so. And but yeah, I think that the doc string is correct. And I would then be inclined to just grab this here copy and paste it here paste. Okay, and so this would be calling. Yeah, so we're calling the data blocks. So I guess we're going to do the two steps. So we're just going to go to the data sets data sets. And so that means we need to be passed in the items source and I've been trying to like grab all that. Okay, so this this thing in data block is going to need a source. It's going to need the weights. It's going to need a batch size. Apparently there's something called verbose. So I don't know what that means, but that's fine. The so the data sets is self data sets passing in the source and verbose equals verbose. And then we called DSS data loaders. And when we did that. Okay, so now we're going to be passing doing DSS dot weighted data loaders. This dot weighted data loaders. That's basically. Oops, what happened there? And then we pass in the weights. So weighted data loaders. Yeah, it gets the weights and then the batch size and then the things we added. Any additional keyword arguments and this will delegate down to data sets dot weighted data loaders is where the keyword arguments get passed to. Okay, so as far as I can tell these same tests all work. We don't need these labels anymore. It is valid. We've already got a data block. So previously we called data set and item transforms and weights manually. So that is our source. So we could get rid of all this. And we're now going to go data block dot weighted data loaders. And we've got to pass in our source. Okay, and we've got to pass in our weights, which were called weights. And we don't need that anymore. Okay. Why did I get zero? That's slightly surprising to me. Oh, no, zero. Yeah, that's fine. Yeah, get zero or one. Yeah, because it depends how it, why is it slightly random? I'm not sure something slightly random. But anyway, it's working. So then, again, for this one, we shouldn't need to do data sets. We should be able to go data block dot weighted data loaders. And we should be able to pass in our items and our weights. Data block dot weighted data loaders. Oh, it's got it. Okay, let's see. And our source and our weights. Why doesn't it like that? Source. So let's see how it's different to what this one said. Data sets. Okay, this doesn't use a data block. So, okay, I can't replicate that. That's fine. Okay, so that's our test. There we go. So what I would then do is I would export it. And if, so that, that, I don't have to like rebuild or reinstall or anything like that. My fast AI library. That's because I have it installed using something called an editable install. So if you haven't seen that before, basically, or maybe you have anyone why, when you go pip install minus a dot in a Git repo, basically that creates like a sim link from your Python library to this folder. And so fast AI, when I, when I import fast AI, it's actually going to import it from this folder. And so now back over here in my weighted thingy. If I draw this data block, we should find that there's now a D block dot weighted data loaders, which I can pass source and weights. And my source is files. And my weights is my weights. Okay, so that's interesting. No weights. Yes, we don't have data sets yet. So that's a very interesting point. So how do we know what our weights are? We don't because they haven't been split. So the Could you not send them through as one of the blocks and as a column get from and then use that because then it would be linked quite intimately with the actual row. Well, we don't need to. I think what we need to do is pass in weights. I think we should pass in all the weights. And then this thing here should then be responsible for grabbing the subset for the training set. And that would actually be much more convenient. Which is after all is what we want. So we should determine the weights based on the the distribution across the classes rather than we should split the weights based on the splitter in a training and test set. So then we don't need any of this. So then weights actually will simply be weighted data frame. So basically what I would do here is this will actually we'll go back to saying this is sort values. And then our weights will be WDF dot label Y. That's actually our weights as a NumPy. Silly question. Could you not just send a function for weights to the standard data block? And if it doesn't get one, then it does nothing. Potentially we could. It's I kind of like this though because like, yeah, it's like weights were all one as a default then could use the one solution for. Yeah, yeah, you could. I just I don't I find it's a little bit too coupled for me. I don't love it, but it's it's it would be doable. It's an unnecessary model case, I suppose. You know, I like how nicely decoupled this is. So I think this is what I want it to look like. So. So I would look at how the splitters work. So the splitter. OK, so the splits gets created here in data sets. Cool. And then I wonder if data sets remembers what those splits are. I don't have tags here. What do you mean no tags file? OK, there we go data sets. So that's control right square bracket to remind you to jump to a symbol in VIM. I see and that's actually mainly happening in this inheritance. The superclass is where this is split stuff. Here we are splits. I see though there is a splits. So DSS dot splits. Yeah, so there's the indices of the training and test sets. And so that's the indices of the training set. So the actual weights we want to those ones. So over here and say training weights. So we'll change this to data set and training set. And so this will be the weights. At those indices. And that's what we choose like so. So thoughts. Thank you so much. No DSS dot splits. Self is a data block and it's actually the DSS that has the splits. The data block has a function that knows how to split. But the split doesn't happen until you create it. That way you can get different random splits each time if you want them. Thank you for checking though. OK, so I'll export that. And. Probably be good to have auto load going but we don't so be it. OK. Now that we did miss a self but it's not the one you thought of. This one here. Yeah, I guess actually if I just comment this out. Then we can just run all above without worrying. Aha. OK, things are happening. So deals equals that. OK, that looks pretty good. OK, so I think we've created our feature. And so then the next thing I would do is to be very, very weird. If any tests broke, but I would go ahead and run the tests. I would then create an issue for my feature. And so I'm going to. So I've got a bunch of tiny little aliases and functions. One's called enhancement, which creates an issue with the enhancement label. So I'll go enhancement. Add. Data block. Dot weighted data loaders. So that creates the issue. As 3706. So if you are interested, you could. Take a look at that issue. It's not the world's most interesting issue, but there it is. All right, looks like the tests are basically. Oh, no, we've got an issue. There we go. So we've got a test that's failed. These must be integers or slices. Yes. Right, so I'm glad we checked. OK, so the problem here is that I've sliced into my weights. On the assumption that this is something I can slice into. Which would only be true if it was a. cancer or an array. But in this case, actually my weights are not. Either of those things. So what I do to fix that. Question here. Yeah. You only keep you back the index of the training and validation data set. And how can you know this is the weights because you haven't actually. Do the calculation and do the inverse of one square. Kind of thing. The weights are being passed in as a parameter. And so we calculated the weights. Up here. And we passed them in. What's the incorrect type that's coming through in the test. It's not that it's an incorrect type. It's that. See how here I'm indexing into the weights using my splits. This here is a list or an array. You can't index into a Python list with a list. You can only do that with tensors or. NumPy array then I guess so. Yeah. I mean, what we actually want to do is check whether it's, it's an array type. Is there an is a listy or something that function. There is, but that's not quite. I think we want the opposite, which is, is this the kind of thing that one could expect to be able to do. NumPy style indexing on. And I believe the correct way to do that. Might be to look for this thing. Yeah. So I would be inclined to say. And there may, there may well already be something in fast AI that knows how to check for this to be honest. Oh, WG. Okay, so this, what's this thing? Oh, that's something that's commented out. All right. So I guess I don't have anything which checks for that. So we'll just do it manually. So if weights has the dunder array attribute. Because I'm pretty sure that tensors have that as well. Yeah, it does. So if it has that attribute, then I think we're good to go. Otherwise, we can use a list comprehension. Oh, you know what we could do. Yeah. Okay. What we'll do is we'll just say, if it doesn't have that, I don't know if this is too, too rude to change their weights. But I think this is fine. It's not a NumPy type array. It's probably going to benefit from being converted to one anyway, right? Yeah. I mean, I don't, I mean, I don't see it downside. Passes our test. Passes all of our tests. Okay. So, and that was our only test that failed, which is now passing. So I would now say we've fixed issue 3706. So I've got a fixes little function that does that 3706. Okay. And so now if we look at that issue, you'll see that it's been resolved using this commit. Yeah. But what do you commit from the notebook? Do you sort of have a like reset with empty cells or do you run the cells? I commit them basically however they are, but with unnecessary metadata removed. So there's a hook that automatically runs this function, which is the thing that removes stuff like the execution count, unnecessary notebook metadata, stuff like that. So the idea is that the notebooks want to have all the outputs in place because they get turned into documentation. And we wouldn't want to run them all in continuous integration to create documentation because they can like involve like spending 10 hours training an NLP model, for example. So we don't remove the outputs for that reason. And also because I want people to be able to look at the notebooks in GitHub and see, you know, all the pictures and stuff. All right. I better stop there. Oh, that's interesting. Did I? Okay. I guess I don't have my hook installed. So I'm glad I ran that manually so you can see exactly what it does. Right. Empties out the execution counts and removes the metadata. Sorry. I'm sorry for another question. I'm just trying to find it. Is that GitHub available in the repo or do you do? Yeah. So it's if you go MB Dev, install GitHub, it installs the hook. And specifically it's going to... Whoopsie, don't see. Is that under MBS folder? No, this is part of MB Dev. Oh, okay. Right. So once that package is installed, it's a built-in command in there. And so that then installs a filter here. I'll read more about it, thanks. And it also installs a Git hook to trustinit books, which calls MB Dev, trust MBs. Anyway, yeah, that's all in the MB Dev docs. And then what's going to happen now on the fast AIS on the GitHub side is it's now busily running all the tests again, like so. And one of the things it checks is to make sure that the notebooks are clean and that the export's been run, and then it checks all the notebooks somewhat in parallel. Yeah. All right. A bit of go. See you all. Thanks. Bye.