 So welcome to class today. Today we're gonna be talking about transfer learning with a supervised and unsupervised perspective. Here today, we have William Falcon, a expert in the tools we are gonna be using for explaining to you how this stuff works, which is gonna be telling us a little bit more about this topic, okay? So William, the floor is yours. Alfredo, thank you so much for having me here and excited to share this with the whole class and everyone. So okay, today we're gonna be doing self-supervised and supervised transfer learning. So this is gonna come up a lot for people. So if you work in an industry or doing any kind of research, you're going to run into something where you may not have enough data and you need to have some model that's been trained on something else. And then you can use that to kind of jumpstart your process, right? So we'll cover it in the context of computer vision today, but this can transfer to NLP speech, anything you want. Counselor learning. Yeah. Yeah. Okay. Fund intended. And some of the main domain at work as well, but I'm confident that in the next few years, I'm sure we'll figure out a way to do that. So the first thing I'm gonna do is someone install Lightning, right? And Lightning is a lightweight wrapper for PyTorch for high-performance research, right? So if you're using PyTorch, it basically organizes your PyTorch code so that you can leverage things like multiple GPU training, GPUs and different things that require a lot of expertise. And frankly, are things you don't need to deal with when you're trying to work and build models. And the second framework that I'm gonna install is Boltz, right? So Boltz is our other framework and it's like a research toolkit basically, so I can't find my master's, okay? So it's a research toolkit. So if you're ever wondered, if you're starting something, it's also used in industry as well, but if you're starting a project and you're looking for a model and you don't find one, you can look for in Boltz, right? And then this will be, what may be possibly one of the latest models, but it's already been implemented, tested and documented so that you can start from a good spot. So you don't have to sit here and try to implement the baseline yourself for three months to see if you get it right. You can just kind of subclass and build on that. And Boltz has a pretty robust library of social-supervised learning. It's a lot of stuff that is personally done in my research as well and things that we've implemented from the latest papers. Okay, if I won't like to learn more about these things, where can I find more resources? Yes, you can go to Lightning repo, right? So this is, I mean, I think probably the easiest thing is to go to PyTorch Lightning.ai here. So we have the landing page there and then we have everything, documentation, videos about Lightning, about the team and everything else. And then there you can click on docs, right? And then go straight to the docs, get started here in the Lightning in two steps, read this, and at the end of this, you should know everything you need to know. Again, if you know PyTorch, this should take like four minutes to understand. Where's the bolts? Yeah, and then bolts is going to be, let's see if we have it on the page as well. Let's see. Yeah, so we have the bolts here. So you can click on that and then visit bolts. And then this is the sound documentation as well, right? So you can click on it, look at the introduction guide, and then we will walk you through the main ideas about bolts. So bolts is a newer project, so it is faster moving. So it will change pretty frequently. Lightning is 1.0, it's stable, so feel free to use it for whatever you want as well. All right. Okay, so installing this, I believe I'm in a GPU instance. Yeah, the cool thing is Lightning, you can actually use TPS and GPS, but I'm going to use GPS right now. TPS are dodgy sometimes. Okay, so the first thing I want to cover is, like when do you want to fine tune, right? So I think that there are certain things you can ask yourself to understand if what you're about to do is useful for fine tuning. So let me paste this image here. So I made a little diagram for everyone. Okay, thanks. So let's start with the green spots, right? So I think the first question is, do you have a lot of data? If the answer is yes, then you likely don't need to fine tune, right? Now, do you also have time and compute, right? You can have a lot of data, but if it's super costly to train, then you're going to want to find something pre-trained. But if you have the money and the time and the compute, then just go ahead and train on your data, right? And when you're done, run on your test data set. Now, if you don't have a lot of data, then you should try to find a pre-trained model for your data that matches your data distribution, right? So this is super important because most vision models are trained on image nets, right? So if you want to do something like, I don't know, cancer detection or X-rays, that's unlikely going to transfer, right? Because an image net, you don't have that kind of data. You also don't have people, right? So you have to be mindful of what you're using this for. So it's not just blindly the magic of a neural network is going to work. And then, so yeah, so if you do have something that matches that kind of distribution, then you can use a pre-trained model, right? So when you think about transfer learning, we have two parts. You have the pre-trained model that was trained on something else. And then you have the stuff you're going to add on top of that to transfer that, right? And then you can find some in your own data. So this is very important. So I would keep this in mind. And then the next things are this, right? So then you have two major options. You have supervised or self-supervised, right? So on a model that's pre-trained using supervised training, learning was trained likely for classification. So something for image and that, for example. And by doing that, you've introduced bias into that model, right? You've forced it to learn a representation that is going to make classification easier. But there's no guarantee that that's going to make something like segmentation or detection or anything else as easy. And so you want to be mindful of that, right? So the self-supervised, I like I said, it's experimental, but it might be able to generalize better. So it was not trained for classification, which means that if you want to do segmentation or object detection, the representations might transfer better. So that's interesting. And then, you know, they're like, I don't know if you're following the literature or not, but there are about seven or eight options that you have to do this. So A&M, Moco, CPC, Perl, SimClear, BYO, and SWAP, right? So in order, the latest one is SWAP. And if you're interested in understanding the difference between all of these, you know, I recently published a paper with Professor Cho a few months ago, and then we wrote this article, kind of going into all the details and how to compare all of them and what the differences are. Because they're really not that different. So I'm going to be using SWAP for this particular case for our self-supervised transfer learning. So the first thing is I'm going to, the first case that we're going to look at is going to be this supervised transfer learning. So this is likely the case that you're going to run into an industry most of the time, right? So you have a small data set of images and some compute budget. And in this case, we're going to pull out a restaurant 50, right? There are many restaurants, 18, 50, one, whatever, one, one, 152, but 50 is kind of like a sweet spot. It's not so big that it's super expensive to train and it's not so small that it won't do anything interesting for you. Remember, this was pre-trained on ImageNet and it was pre-trained for classification. And ImageNet is a data set that has a thousand categories. Each of them has a thousand images. So it's like one million, roughly one million images, right? So that's why it's actually very successful because these huge amount of data allows us to distill a model that has a very good prior in terms of natural images, right? Because it's been like those thousand classes are names of natural objects. So that's why usually it works quite well. But nevertheless, as you pointed out before, if you want to use a network for medical images where the statistics of your images are completely different than what they are in ImageNet, then it's completely hopeless to have any kind of decent result. Nevertheless, let's start with what is commonly done for normal images, right? Yeah, so I think to start, we're using this library called TorchVision, right? So by the Torch team as well. And in there, we have a bunch of pre-trained models. So this one's a restaurant 50. So I'm gonna set this to true so I can load that model, which is going to download, I assume, some weights. Great, okay, so there are the weights. And then now we can use this to run predictions, right? So let's pretend that our data set that we will actually care about has 10 classes, right? And those classes are like frog, horse, whatever. So I'm cheating because there's this data set called CIFAR 10, right? So CIFAR 10, and this data set looks like this, right? So again, there are tiny images of 32 by 32 pixels and three channels per color. But you know, it's a useful toy data set. It's better than MNIST, especially because you're using pre-trained ImageNet and like I'm pretty sure on ImageNet, there are not a lot of fake digits or handwritten digits. So it wouldn't transfer super well, but there are dogs and cats and birds and all that stuff. So same kind of domain, sorry, same kind of category. Yes, great, so let's use that guy. So we're gonna set up our data set, right? So let's pull in CIFAR 10, again, from torch vision. Yep, and then we are going to use these transforms, right? So something that's useful normally for these cases is to normalize your image. So we're gonna make it a zero mean and then one for standard deviation, right? So we're gonna add this on here. And I'm not gonna add it right now because I wanna actually plot the image for you so you can see what it looks like. So I do this and then that will actually download. I need to import that transforms, right? Here from torch vision, import transforms. Okay, so this is going to download, extracts, and then we have this data set, great. So that's ready to go. So let me just plot it now. So I'm just gonna copy paste some map, let's go to do that. And I'm gonna also plot the, so show you what the label is. And there you go. So you can't really tell, but that's label size. It's a super nice frog, I can't tell. It's beautiful. It's got like a red and orange eye. But if you look at label six, I mean, let's just verify, right? So zero, one, two, three, four, five, six. So yeah, it looks like this guy, kind of. Yeah, so that's a frog. So let's normalize it now. So that's, I can put it to the neural network. Oh, it's already downloaded, okay. Oh wow. So that's great. And cool. So now we don't wanna iterate through these images like one at a time, right? So see here, this is an image, a single image. So we actually wanna do batch of images, right? So we're gonna use a data loader for that. So I pull that in and I'm gonna say batch size 32 and I do wanna shuffle that. So there it is. And then obviously, to iterate through this, it's just a simple for loop, right? So for batch in data loader, get the batch, expand it out, print the shapes just so we can see what they look like. So 32 is the batch size, which is great. Three channels, 32 heights and 32 width pixels, right? And then our labels are 32, so just 32 scalars. And then now we're gonna, I also need to modify my rest nets. So if you remember the rest nets was trained for image nets. And as you said, there are 1,000 classes there. So it won't really work when I have a data set of 10 classes. So for that, I need to modify my rest net 50 to take that. I think they're cheating because actually the rest net 50 has a rest net 50 plus these FC fully connected layers at the end, right? So we're gonna replace that last fully connected layer, which I don't know what the size of the output is of that rest net 50, but I know that I need to have 10 as an output because that's gonna be the number of classes that I have, right? So what's happening under the hood is you have this like rest net, right? A bunch of layers here. And then at some point when that ends, then you have this fully connected, which is mapping the output from here into whatever number of classes you have, right? So this guy I wanna replace, you know, depends on the context, you sometimes can drop it, put an identity function, whatever you want. Okay, so we'll replace that. This is to make sure that works. Okay, perfect. So now we're good to go. So let's go ahead and predict some stuff, right? So I'm just going to load another batch here and then I'm gonna run into my rest net. So this image, and then I'm gonna look at the first time predictions only, right? Great. So like, that looks good. And I noticed that the highest number is this first guy here, right? 0.7 looks like. So let me do a softmax just to kind of like turn this into probabilities, right? Great. So 0.19 that looks like the highest one. So this is what the network would predict for this particular image, the label. And we're gonna use the arcmax to pull the label name, I guess, in this case. So we get zero, right? So that makes sense because that was the highest number as well, right? And there are 10 here. Where do we feed the image to the network? Oh, up here, right there. But X was a 32 by 32, right? And the rest net usually is trained on the 100 and something, right, pixels? So those are more. Yeah, correct, it's convolutional though. So the inputs don't really matter as long as you don't shrink it too much, right? So I'm cheating a little bit, but the C part 10 is fine. 32 won't disappear. If you did like MNES potentially, you could have a crash because at some point you'll down sample way too much. Uh-huh, okay. So yeah, so it's a beautiful part about the rest nets. Sorry, about convolutional networks. You know, the input can change as well. Okay, so let's put the labels. So I pulled the labels out and they don't match, right? So obviously you're like, well, this is supposed to be this fancy pre-trained image net model, but like it didn't predict the labels. So my labels are seven, I predicted zero. None of these are great. It's this one by chance. So one out of 10 makes sense, right? Expect the number given the classes. Yeah, so they don't match. So the reason for that is because we haven't, remember we replaced this layer here, right? Where is it? Yeah, here. Nope, not there. Here. So we replaced this layer. So like all of the last logic, if you replaced anything in this pre-trained model, it's not going to work, right? And the last logic we made completely random. So now I have to fine tune this thing. So the process of fine tuning is, you know, you can do it two ways. You can take this pre-trained net network and keep it kind of frozen, right? So you're never going to backpropagate into it and then strap anything else on top of that. I'm going to use a linear layer, but you could use an SVM. You could use logistic regression. You can use a random forest. Any classifier doesn't matter because what you're doing is you're using the neural network to extract features. And then you're using some other classifier to use those features and classify, right? And if the network did its job, they should be linearly separable. So then the SVM would be fine at that point. So in this case, I'm going to be lazy. I'm just going to use a linear layer, right? So I'm going to actually separate them to make this more clear. So here's the backbone, okay? So I'm not going to mess with anything about the FC layer. I'm just going to create a completely separate layer now. So this is, I'm going to call this a fine tune layer. And I'm just going to take whatever the output features where of that backbone FC and then I'm going to map them back to 10. So I'm just adding another layer now. So in bolts, so in Lightning, we have this concept of data modules, right? So notice I only have a train split for C410, but I actually want like a validation and test split. And like, you know, it's just going to be super annoying to do it myself and split them up. So I'm just going to use this data module that we have, right? So the data module is literally just going to be three data loaders, right? So it's going to be a train, a val and a test data loader. And then it has within it all the splits and everything you need to care about. So let me show you what I mean by that. So I'll go to the documentation. I'll go to the vision data modules, supervised learning C410. So they're here and I can look at the source. So here's the source code. And now I see that, you know, there's a bunch of preparation and all the kind of like boilerplate stuff is taken care of for you. So I'm just going to get a data loader with the train splits, a data loader with the validation splits and a data loader with a test split. And I don't have to deal with any of this other magic, but it's just a PyTorch data loader. So I'm going to use this guy because I want to save us all a bunch of time. Okay. So let me just code this up in plain PyTorch real quick, just to show, you know, the basics of this, right? So I'm going to import an optimizer and then our last function here. And I'm going to use Adam, but I'm not going to, I don't want to update this backbone because as soon as I start updating it, it's going to lose kind of its representation. Yeah, yeah, okay. I just want to use a find. Just during the classifier. Yeah, just a classifier right now. So I'm going to do that. And then I'm going to just iterate through my, you know, I'm going to set to any box here. And then again, this is the data loader. I can just use the data loaders directly, right? This is, you can use them in PyTorch. So I'm just going to pull the train data loader out and then iterate through here. And then I want to run the input through the backbone. I'm going to get a bunch of features, right? So I'm going from batch by three by 32 by 32. So channels, heights and wave pixels to batch by 1000, which is the number of classes on ImageNet, but we're going to treat this as embedding dimension here. And you know, I don't have to do this because my optimizer is only looking at this fine tune layer, but I'm going to detach here, right? So I could just attach it here. So it doesn't matter. I'm going to comment this out for now, but I just want to show that like, if you were training both at the same time, you could just detach the features and pass them into the classifier. Yeah, but actually I would recommend you to do like, if you go to a couple of lines above, if you click under the X, below the X comma Y equal batch, just below. Yeah, one line below. Okay, there we can, one line. Yeah, we can write width torch dot no grad. Oh, sure, there you go. So I would prefer this actually. No underscore grad parenthesis and colon. Yeah, and then we indent that feature. So this is basically not telling pytorch not to track the computational graphs, right? So it's going to be much faster and also, hold on. I think it's going to be not sure faster, but it's going to be not taking any additional memory because we are not going to be doing that propagation through the backbone, right? So this is usually, this is the best practice, I would say. Right, yeah, makes sense. I know, again, but I'm only optimizing this guy, right? Yeah, yeah, I know. But yeah, to your point, you won't keep a graph when you do this, which is definitely more efficient. Okay, so then, where's my screen? Okay, so we have the features and then we're going to run them through our fine tune layer, right? So this guy here. So again, this fine tune layer could be like, you know, an SVM if I wanted to do, right? It doesn't matter. But here I'm using this linear layer that I created up here. Okay. So I'm going to run it through the fine tune layer. It's going to give me the predictions. I guess it'd have to be differentiable in this case. But that's okay. And then I'm going to calculate the loss, right? So cross entropy. So I'm going to pass in the predictions and then the labels. I did the space missing, hold on. Yeah, let's go. I like that we both tune in immediately to be like, why is that off? And then, you know, the standard backward optimizer stuff, optimization stuff, right? So we'll do that. Cool. And then, you know, I'm just going to print the loss just so that we know where we are. Okay. So I'm going to run this guy and hopefully you'll start seeing the loss coming up. Let's see. There it is. All right. It's something's happening. The magic is happening. Okay. So I'm going to stop this because it's super slow. Where are you seeing a GPU machine? So I kind of want to use this GPU. So I'm going to just convert this into Lightning real quick. And that process is going to be super fast because I'm just organizing what I did, right? So I have all the same stuff there. So the first thing I'm going to do is just create this Lightning module, right? So this class here. And PL, we import it at the top, right? So it's just Lightning. Did we import it? No, we didn't import. We just installed it. Yeah, so let me just import it then. So I import PyTorch Lightning as PL. Great. So this is the same basically as an nn.module, right? But it's just Lightning. So, and then I'm going to change things a bit, right? So I want to kind of, I need to call this first. Oh, sorry. So I need to write our init function. Yeah. I write this so much now this. OK, so and then I'm going to init our super class. Oops, there you go. Great. And then I want to use the backbone, right? So I'm just going to bring that guy in. What was it? Here. So, you know, I'm just going to define it in the model. Here, solve that backbone. And then I want to also use this linear classifier guy. Here. OK, so here's our fine-tune layer. Great. Same thing. Perfect. OK. And then I want to parameterize this a bit, right? So this 10, I can make this more general. This is an image classifier. So I'll just say num classes, right? And we'll default to 10 because it's going to be for C4 10, I'll say, but I can just do that. And then so that's it. And then now the training step. So this is where this is what's going to abstract away the training loop so that, you know, we don't have to deal with all this boilerplate code. So this training step is like the forward in a modern. What is it? Yeah, it's kind of like a forward, but instead of just doing kind of like a forward pass of computation, so let's you define all the interactions of the model, the loss and everything else, right? So it's capturing like a full system. So if you were to do bird or a GAN or VAE or something like that, all of that logic would happen in training stuff. So it's easy to understand and keep together. So what does this method has to return? So you have to return a loss, right? Oh, OK. Yeah, and I'll show that in a minute, but it has to have a computational graph attached to it so it can do the opposition. So where do we have all that? It's literally just all of this stuff here, right? So I'm just going to copy. OK, so your training loop that we wrote before in PyTorch, it goes inside the model now? Yeah, exactly. It's a little bit interesting, but what it does is it makes it so that your model self-contained, right? So in this version, you're not reusable. Like I can't reuse this code for like a different task tomorrow. It's like very specific to this thing. So I'm just pulling out the relevant stuff, which is what, you know, what we're going to spend 99% of our time on is modifying these. So we're going to keep a few things there. You just selected half. So what about the last part? Yeah, so that you don't actually need, right? So lightning is going to do this for you automatically. It's going to call backwards the step and the zero grad. There's a way that you can enable it if you want to, but and you can do it yourself. But we're going to use something called automatic optimization, which is enabled by default and lightning, where it's going to do it for you. You can turn that off and then you just call it yourself and that's fine. OK, OK. So otherwise, we just have to specify basically what the loss is given a batch. Is it correct? Correct. Yeah. OK. So let's let's do that. So I faced all that. So nothing changed. I'm going to remove this thing here because I mean, you know, you can leave it if you want to guess. Doesn't matter. No, no, leave it. Why do you want? Why do you want to remove it? Yeah, well, because I want to change eventually so that we can actually find you in the backbone. But we'll leave it for now. Well, we'll make that change later. OK, so this is self the backbone now, right? So let's just let's back and then this is self the fine tune layer. Cool. That should pretty much have all the same stuff. And then I return this loss here. Great. I see. OK. And then the last thing is so so a lightning module is like a recipe for a model, right? Like or like a class or whatever you're trying to do. Here's a classifier. It's not it's not a model, right? It's just like a general classifier. But the last kind of ingredient I need is this optimizer, right? So I need to know what optimizer I'm going to use. Yes, yeah, return that. And there's a there's a method called configure optimizers wise specify that. So these names are, I believe, private names, training step and configure optimizers. Where do I find them? How do I know what are these names? Yeah, so those are on the documentation, right? So when you go through here, lightning in two steps, it will ask you to walk you through and I'll say, this is what you need to define, training stuff, configure optimizers. OK, OK. Yeah, and forward is optional. I'm not using it, but you don't have to use it because we're not using this model for predictions. So I don't have to actually define it. So so so if we if we would use this model for prediction, you actually would use the forward. Yeah, correct. So in this in this demo, I'm using an auto encoder. So in this particular case, I wrote the auto encoder to generate embeddings when you use it. So I just wrote up the forward to do that. But the training step is separate. OK, so we have the in it that is basically defining all the modules inside. Then we have a forward, which is, as you said, use only whenever you may use the model, but we don't necessarily need. Then there is the training step where we define how the loss is computed given a batch and a batch index. And then finally, we have this configure optimizer, which is specifying the optimizer we're going to be using for adapting the parameters of the network. Is it did I did I correct? Yeah, that's exactly correct. All right, OK, OK. Makes sense, makes sense. So I'm going to use that optimizer, right? So Adam, we have it up here. I just can't. OK, so, you know, when we did it up here, we were just passing you that fine tune layer, right? You don't have to do that. So I'm just going to call self here. And that's going to just pass in all of the parameters, right? And that's OK because the backbones is disabled because there's no grad thing. So it's going to be fine. I'm not actually going to back propagate into it. And then I'm going to make this learning right. Think like this. And then I need to return this optimizer. Don't forget this or you'll get random noise. OK, so that's literally it. So again, this is the same code. OK, but now it's like vastly less boilerplate, even for such a simple project, right? And then now to train this, very simple. So I'm just going to init my model, right? Oh, actually, I forgot one thing. So this learning, right? I kind of want to tune it. So let me just make it a parameter, right? And then let's do that here. Three. And then I'll pass it in. Self.LR equals that. Great. So I think there should be some very nice trick, I believe, in a lightning that I don't have to type this. Yes. Can you tell me? So we have save hyperparameters, I believe. Hyperparameters. Yeah. OK, we don't actually have to. Now, when I do this, I don't actually have to go and say self.nonclasses equals x self.LR equals whatever, right? I don't have to do this anymore. I can just add this up directly. OK. So I can call it here. So it's saved under this thing called hparams, hyperparams, right? OK. It's actually there directly. I see. And this is useful because in most models, you have like 30 parameters, right? So you don't want to do this manually. OK, great. Now just to train this thing, it's very simple. I'm just going to init that model, right? So how do we train this stuff? Yeah, so remember all the stuff that we got rid of? So epoch, batch, optimizer, whatever. That's all inside this trainer that basically handles all the engineering for you. Which trainer? So it's here, right? So this trainer is here. Oh, I see. We haven't explained yet. Yeah. So in the Lightning library, you only have two things you need to know about in Lightning, this trainer and then this Lightning. So the trainer is like literally your simple training group, right? So let me just show you. So Lightning module explains that. And then the trainer is here. So Lightning API, the main APIs. Oh, I see, I see, I see. OK, also there's just two there. Yeah, that's it. OK. And then I think we have even a pseudo loop so you can understand how what it's doing. Let me see. There's it. Yeah, I think it works. It's doesn't it? OK, but I think I got it. So I just go to the docs and then there is the API. And then there are the two things that you mentioned before. There is the model and then there is this trainer, which is training the model, which makes sense, I think. Yep. And then the Lightning module tells you the same thing, right? And then actually I think it is the Lightning module where you're right through it. Yeah, so here you go. So it's showing here under the hood, Lightning's going to do this. It's the same thing that we wrote here. OK, OK, OK. That was my question, actually. So this is exactly the thing that we wrote on the notebook, right? So we have the model train and then you go back to the documentation. All right, so the other one. Yeah, so we have the model in the training version, right, training mode. Then we enable the gradients. And then I guess there is these saving the output of the from the training step, which is going to be the loss. Then OK, we compute the backward pass and then we step and then the zero. OK, yeah. And as you see, you were passing. Oh, there's a bug here. This should just say batch here, right? You're just passing the batch directly to training stuff. OK, so we're back here. So that was training stuff. Configure optimizer. So we're good to go. The trainer is going to do this. So on Colab tends to tends to freeze because the update happens very fast to. So when you start training lightning, it's going to print a little progress bar and it's going to overwhelm the screen. So we want to slow that down. So let's change that to call it 10 so that we're not freezing actually 20 so that we're not freezing the screen. And then we also need the model or classifier, right, that we just wrote. So here you go, this guy. And then we're going to I like to swap these. So I'm going to add them here. And then I'm just going to train. There is a class. Yeah, so right now it's an instance of a class. So trainers. So I'm going to create an instance of that. I'm going to call it and then I'm going to pass in my classifier in here. Uh huh. And then we have the data. So as I mentioned, I have just this train loader. So I'll show you first is the train loader. So I can pass that in. It's not a problem. So I'll pass in the regular pytorch data loader. And then this will just start training, right? Oh, wow. He's also using the GPU available, true, used false. Yeah. And actually we give you a warning. We're saying, hey, you have a GPU, but you're not using it, right? So we have a good experience there. Anyway, so you see the thing is training. I'm going to use that data module, right, that we created instead. So I just passed that in. And it's going to be the same effects, right? So you don't have to deal with it. It knows to pull out the training. Oh, OK. The training splits. OK, so it's going. Let's give it a few seconds. So also here, TPU available. You can also use TPUs you said before, right? Yeah, correct. Wow. OK, look here. I'm just going to use the GPU though. So we have the GPU and I'm not going to change my code. I'm just going to set GPUs equals to one here. And now we're training on a GPU. And you'll see that it's much faster now. Oh, wow. So can we try as well on the TPU just for sake of curiosity? We can. It's just you have to install this XLA library. Oh, OK, OK. Then next time, OK, sure. So the caveat with TPUs is that Google is working super hard to make sure that they can support and PyTorch together to support TPUs. So the experience is not quite there yet. You still have to install this XLA library. And if you go to the Lightning docs, it'll show you. But yeah, once you do that, you just change this to, I think it's TPU cores. And then you set it to one or eight or whatever you want. And then you'll be training on the TPU. OK. There are plenty of demos on the website to show that. OK, OK, OK, OK. Just to be sure. Yeah, I just want to keep this one focused on the transfer. OK, fine, fine, fine. OK, so wait, well, we were talking to trains to see the losses at 1.95. Oh, wow, super long. So let's just see what happened. So let's see. Let's see how it learned so far. So Lightning creates logs for you automatically. And I can just launch TensorFlow to visualize. Oh, wow, really impressive. You can do anything. And here we are, TensorFlow shoots up. And now you see, oh, you didn't log anything. Yeah, so we have to log something. OK, how do you? So let's log. Let's just log this loss, right? So I'm going to say self.log. And then I say train loss, pass in the loss. I also wanted to do the accuracy. So let me just pull out our fancy metrics library from Python to lightning.metrics.functional import accuracy, right? And then we're going to also log the accuracy. Self.log. So self.log is something, is a method in the model? Yeah, so it's a method of the Lightning module. So what's cool is that self.log, so you're training on 1GP right now, but when you start training on 8, 200, whatever, you have to sync logs across GPUs. And you have to calculate metrics correctly. Like you can average accuracy, but you can't average something like ROC or something like that. So Lightning handles all that distributed stuff for you and syncing and when to do it if you use self.log and everything else. I think you made a mistake there. So this is something that I think every time it's very tricky mistake. You didn't just detach that loss, right? So it looks like you're saving your log in the whole computational graph there. Oh, no. So when you log it, Lightning will detach anything. Oh, wow. Really? Yeah, you don't have to worry about it. Nice. OK. So nice. Go on. We try to make sure that we keep people making mistakes. OK. OK, so let's log. So we're logging the training accuracy as well now and then the training loss. Yeah. So we have this warning here. So we're not validating right now. So Lightning is saying, hey, you pass in about data loader, but you haven't implemented a validation step, right? So we only have a training loop. And it's saying that because the data module has a validation split attached to it, right? But that's OK. I don't want to validate right now. So I'm just going to train to see what's happening with this fine tuning thing. OK. Otherwise, if you were using the previous version, the one with the PyTorch data set, you had to pass both the training and validation separately. Yeah, so you have to do this and then pass in about loader. Oh, I see. OK, so you have to send them separately. But instead, in this case, you will send a class which contains. I see. OK. Makes sense. OK, so what is that, an epoch and a half? Great. Let's see what happens. Let's reload this guy. Oh, did it load already for me? Nice. Maybe I don't have to reload it. Let's see if TensorFlow did its thing today. It didn't. OK, so I'm going to have to reload it. I'm going to zoom out for a second because this is huge. All right, so epoch. Oh, wow. OK. The one epoch and a half, fine. Let's look at our train accuracy. It's all over the place because it's training, right? So you mostly want to be tracking your validation, like epoch accuracy. But we know it's high. Without doing anything, we're already at 28%, which is great. So it shows the transfer learning is working. And then our loss should be going down. So it's kind of bumpy, but that's expected, again, because this is the training. So yeah, so now we're seeing the logs of this, which is great. So now that we have this stuff here, I actually want to unfreeze this rest net after a few epochs, right? So we're going to fine tune as it is right now. But let's just say 10 epochs. I'm going to say, hey, unfreeze the backbone and start using the backbone as well. Yeah. So the different learning rate, usually I do that. Yeah, so you can adjust that as well, right? So I'm going to skip the learning rate adjustment, but let me start just by changing the epoch thing, right? So the Lightning module, no, the trainer, yeah. So Lightning module has a pointer to the trainer, and then the trainer knows what current epoch it's in, right? OK. Epoch. As long as it's less than 10, I'm going to do this stuff here, right? So hold on. How does the network knows about the trainer? Yeah, so when you start this training process in here, in FITS, then the network gets assigned the trainer, and then it knows what it's in. Yeah. OK, OK, OK, that's cool. So in real life, you would wait 10, 20 epochs, but right now, if we're limited, I'm going to put one epoch. Actually, create a future of Lightning. Let's change it here. So you see how long this epoch is taking right now? Because I'm going through 1,500 samples. Let me just actually limit the number of training batches so we can go through more epochs faster. So I'm going to go through 50 training batches. That's it. And then actually, I can do this realistically. So I can say 10 epochs. So as long as the epoch is under 10, I'm going to do this, where there's going to be no gradients. As soon as you're out of that, though, right? So wait, I need another statement, sorry. Here, so as soon as you're out of that, I'm just going to actually pull out the features, right? OK. So now I'm going to be back propagating to this thing. Great. OK, so I just made some changes. And I'm not quite sure if this is going to break. So I'm going to use a quick debugging trip. It's called Fast Dev Run. So I'm going to turn this on real quick. And when I enable this, it's going to hit every single line of code. And it's not going to train. It's just going to do one batch very quickly, just to make sure that I have no bugs. So let me just run that real quick. And if this completes, I have no problems. OK, great. No bugs. So let's find the compiler. If I had a bug here, if I had been like asserts, false, right? Yeah. Then it would catch it without having to train the whole time. Great. Oh, OK, OK, I see. OK, so real live debugging here. OK, perfect. So I'm going to disable this thing now. And I don't need 50 batches. I'm just being ambitious. Let's use the 20. That's fine. And then this should power through the epochs a lot faster. OK, great. So that was one epoch. So you see it's going super quick. And it's kind of weird because this refresh rate is 20. So it's only going to look at every 20 epochs. So let me just change that to five batches. So every five batches is going to update this bar now. OK, we have 10 minutes left. And we still have to cover the unsupervised learning just to let you know. Got it, OK. OK, so here, so let's say epoch 2, epoch 3, epoch 4. So you see the loss is going down nicely. So let's just keep it going for a minute. And then it's going to unfreeze at some points. And then the loss is going to drop a lot more. Now, as you mentioned, you have to lower the learning rate as well. I'm not doing that. So I'm not going to get the best reforms out of this. But you can do that as well. How many epochs does this stuff go for? Oh, by default, 1,000, right? You can set a max limit or something. In the trainer initialization. Yeah, here. So max e-box, I think it's called. OK, OK, it's going. Yeah, it looks like that didn't work because this loss is high now. Maybe it starts working now. So yeah, had I adjusted learning rate, that wouldn't have happened. It wouldn't have jumped in loss. It would have just been fine. But I think it's adjusting now. So let's see what happens. Oh, it's also 50 batches. So that will explain some things as well. OK, there you go. So now we're under 2, which is great. So we hadn't seen that before. And so we're dropping faster. So let's just show the loss now chart just so we can see what the effects were. Luckily for self-supervised learning, not much changes. You only change the backbone. So should take a few. Oh, well, OK. OK, so the last two that we cared about were these guys, right? So this was not frozen. And this was, sorry, this was frozen backbone and unfrozen backbone. So let's see what happened. Yeah, so you can see the training accuracy at some point started going higher once the unfreezing thing happened. Also trained a lot longer for sure. So we didn't set a seed, so it could be the random in it as well. And then the train loss as well. So it kind of starts going down. And this jump up here is because of the unfreezing part, right? So that's when that happened. Actually, I think it's around here. But it was already lower at this point, obviously, because of train longer. But unfreezing tends to give you a better performance over time. But I think it's really specific to your task. So let's use self-supervised learning instead. All right, so we have like, I mean, it's more in this last part. Yeah, so instead of using this ResNet 50, I'm just going to use the swap model, right? So all I'm going to do now is I'm going to look at bolts, and I'm going to load those models, right? So here is the path, right? So you just have to give it a weights path. And you can find that in the docs, right? So it's just both. Show me, show me, what is this swap thingy? Yeah, so swap is one of the latest methods coming out of fare. So it's just both. So I'm going to look at self-supervised learning. Oh, OK. I have contrast of all the methods here. I'm going to use swap. And then I have the image in that baseline here. And then I know I can load it through here. So I can just copy, paste this, right? Is there also a link to the paper somewhere here? I believe so. Let's see. Yeah, there it is. Oh, OK, OK, I see. And then you can, you know, just adapt it from the official position. We actually worked with Matilda to do this as well. So it's super helpful. And yeah, so here's the, here's, I'm just going to copy this, right? So I can do this. Sorry, apparently my mouse doesn't work. OK, so that's all I need to do. So I'm going to bring that guy in. And I don't need to freeze anything. So I got swap here, great. So it's going to load the checkpoint for that. And then swap has this model inside it, which is the backbone, right? So I just need to pull that out. So swap.model. So that's going to be the backbone of swap. And then swap in particular outputs 3,000 features, right? So I'm just going to change that to 3,000. And then in this particular case, swap, this model also outputs two feature maps. So I just want the last one. I don't need that. This is a pre-trained model, is it? Yeah, it's a pre-trained model, correct. But it's pre-trained on images without labels, is it correct? Yes, so it's sub-supervised. So it was pre-trained on ImageNet without any labels whatsoever. So hold on. Before we were doing, so we are doing transfer learning. And before we were doing transfer learning from supervised learning using a network trained on ImageNet, right now we are doing still transfer learning, but using a network that has been trained on, we don't know what, but without labels. Correct. So maybe it's nice because I can pre-train my swap model on my own data, which I don't have many labels, for example. And then I can just train the classifier with a few labels I have, right? Yeah, that's a good point. So if you're working at a company, most likely you're going to have your own data set that might be massive, but it will cost a lot of money to label. So you don't need to worry about it so much. Maybe until this stuff works really well, you might have to, but you can try this out and then pre-train on supervised on that, and then use that. So yeah, that makes sense. OK. OK, so we get the backbone, 3,000 features. Again, this particular model is going to output two feature maps. It's just the way it's trained. So I just want the last one. I don't want both, right? So let me just make that more clear. So features equals features negative 1, because this is actually an array. Actually, let's call it feature 1 and feature 2, right? So I just want the second one here. OK. And then I'm going to do the same thing here when it's unfrozen. Great. I believe those are all the changes I need to make to get this stored. So that is it. That is the extent of modifying for self-supervised learning. I told you it's going to be under 10 minutes. I'm impressed. I never actually done this myself. So let's see. I don't have to change anything now. Let's see if this works. Look at that. Wow. No way. So we can actually, if we let it run a little bit, we can also compare performance, right? Later for the. Yeah, it's really lower. It's at 2 already. Like the other one didn't even get to slow before. And it's not even unfrozen yet, right? So yeah, I don't know. It's interesting. Wow, OK, OK. Oh, that's impressive. So let's wait a little bit and then we can compare, I guess. It's below 2 already. And you know, the other one didn't get to below 2 until we unfroze the model, right? Now we're about to freeze it because it's later 10. So let's see what happens. So now we drop. So this one even didn't spike. This one didn't go up to 3. something. It just started to go down. So it's 1.4 now. It's working much better. Like, I don't know why that didn't spike. Like, I think that obviously this is an open area of research. And like, you know, I think most of us in the lab are like working on this as well with with John here or Jan. I think that the spike, the spike might be coming from the fact that that model final learning rate was pioneer. And now that we are making it a larger one, maybe we escape from that small from that kind of minimum we were in, right? So I think perhaps it's a good theory. I mean, I'm just assuming that I don't know if I'm right. Yeah. I mean, I would say I have no idea. I mean, I think there's definitely an aspect to that. I think I would say something like, I don't know. Yeah, I think that's probably the most reasonable explanation. I want to say that it's about self-supervised learning, but I can't say that with full certainty. All right. So let's compare the performance of these two of the last ones, so we had to refresh the tensor board, I guess, if it comes back to life. There we go. All right. So the last three are the ones we care about. Just the last two, I think is enough, seven and eight. Yeah, OK. Fine. OK. Because the six was not trained, I think. Supervised, and this is unsupervised. So blue is unsupervised. So let's see. They all, they can actually, can we change the name of this version two, three, four, such that we can put some more descriptive thing or? Yeah, you can, you can. You can, the docs will explain how to do that. OK. OK, so train accuracy, unsupervised. Wow, OK. And then the train loss. There you go. OK, OK. You convinced me. All right. And so we have seen how to perform a transfer learning with a pre-trained backbone first with a supervised pre-trained backbone, then with a self-supervised pre-trained backbone. But then what is the advantage of using this transfer learning? So generally, we're going to be able to kind of get a jumpstart on our training, right? So we're going to be able to sometimes converge a lot faster, but second, be able to generalize to the data set that we were training on. So remember, we had this data set that the model was not pre-trained on, and we're hoping that we can generalize to our test split of that data set, right? So as we showed right now, we only train on the training split. So that's great just to make sure things are working. But you want to know if you're going to do well when you see actual data, which we're going to use the validation support. All right, so let's see how we can validate our networks, right? Yeah, so let's do a quick recap, right? So we have, this is the supervised model, right? So this was using our standard ResNet50 backbone. And then below, I have the self-supervised model. So we just swapped that with the swap model. So let's go ahead and train the supervised one. And we're going to do it for 20 epochs. And we're not going to limit the number of batches. We're going to train on the full thing so we can see what happens, right? So I'm going to set max epochs equals 20. And since we want to do the validation, I actually need a validation loop as well, right? So here, the simplest way to do this is just copy, paste, and scode, because it's largely the same. And then I can just reword this, call a validation step, and then replace these words with val and val. And these are also key words, right? For the PyTorch lining, this validation underscore step. Exactly. OK. And what you notice here is that, so I want to make a point. So in training, you want to log something every batch, right? In validation, it doesn't really make sense to log something every batch because they're independent. So you want to calculate the accuracy or the loss across a whole epoch. So you don't have to deal with that. As long as you just use log, lightning knows to do it the correct way. So it'll aggregate across the epoch as well. Now, this stuff is, we don't really need this. It doesn't hurt to have it, but we could simplify things by just getting rid of all this. Since there are no gradients already, they're disabled or in validation automatically. So we can just simplify this. OK. And then, yeah, so we can leave it here. Now, you'll also notice that this code is largely the same. So yes, you could write an intermediate function and then just use the same one. But just for simplicity, we're going to keep it as it is. OK. OK, so let's now train for 20 epochs. And you'll notice that it will run a few validation batches first. Make sure you have no bugs. And now we start training. OK. OK. So we're done training. So let's see how 20 epochs. OK. Let's reload the sky. All right, there's our 20 epoch model. Let's see what our validation accuracy looks like. Not bad. So you can see, let's look at the epoch where that happened. So you can see that change. So around step 14,000, we had this big increase in an accuracy, validation accuracy. I'm going to guess that's where the model was frozen. So let's look at that. That's step 14,000. And 14,000 around here. So yeah, it's about step epoch 10. So the 10th epoch, right? So it's index 9. So yeah, so that makes sense. So you can see that on freezing the back one at some point will then enable you to kind of reach the next plateau of performance. So let's do that. This was for the supervised pre-training part, right? Yeah, and then let's also look at the loss. So this is by epoch. So you can see as well what happened there. You know, one thing you could try is also just on freeze it from the get-go. You might be able to do better as well. But let's do the exact same thing with the self-supervised version. OK, so we need to add the validation method to this one as well, right? Yeah, so let's just do that. Let's copy, again, like we did, we're just going to copy the training step. We already had the training, the validation, right? Yeah. And then I'm going to rename this validation. Oh, OK, you can copy from the previous. Yeah, well, it's slightly different because we have these two future maps. Oh, OK, you're right. I forgot. Yeah, so about the loss and then about act. And again, this no grad thing doesn't matter, but I'll renew it just to make it cleaner. Great. OK, so now we should be able to train on. And I want to make sure we see the exact same training regime, so I'm just going to copy, paste, this guy. OK. So this is going to train on $20. And now we will use the validation sets. So this is SWAL, self-supervised. OK, so we are done here with the training. All right, let's see how it did. So this is experiments number one. So let's look it up on this tensorboard. It did not upload automatically, so let's refresh. OK, version one rates. Let's look at the validation accuracy. Oh, wow, much better. So in blue, we have the self-supervised model. And in orange, we have the supervised model. So they might convert at some point, but what's interesting is that it reaches a higher accuracy much faster. Yeah, it's impressive. So it's great. So I think this is something promising about self-supervised learning is that it should help you get, speed up your convergence. So it'll save you money, basically. So let's train it without any of those just to show what happens when you don't have any of these pre-trained models. So I can do that by just turning this guy off, right? So I'm not going to load weights. So by default, it is false? Correct. OK, can we check? Yeah, but we can just set it to false. OK, that's better. Thank you. So let's also go ahead and remove this unfreezing part and just train it with that unfrozen from the very beginning. Yeah. So I will just delete this. All right. OK, let's see how it goes. So here we actually, let's remember that we are training the full backbone. And therefore, we are doing forward and backward through the backbone. So it's taking way too long. So we are going to be stopping right now at the epoch 10. So it should be just fine. And let's compare now the validation curves. So refresh the tensor board. All right, do you have a guess for the best accuracy for that model? I hope. No, I don't know. I mean, I think I do, but I want to see. I don't want to. Then getting like OK, let's see. Oh, wait, sorry. There you go. It's going up. I mean, it's just standard, so it's very slow to converge. It probably will get there at some point. But the amount of compute that was required to get this one is a lot cheaper because you already had this strong prior. So yeah, it's good. I mean, it's not bad. So actually with the blocker, if we actually do like one step only, we just get immediately to 80%. So I think even if we don't do the training of the classifier at the beginning, and we just do one step directly, everything, we get immediately at a very good initial point, right? Yeah, so if we leave the fine to layer, if we just leave everything in person, we're just going to start from there. And so this was trained with roughly 50,000 training samples. Let's see what happens if we are really pushing hard and using really few label samples. So I want to show you this really cool function in the Torch library called random split. So if you have a list of, if you have a data set, for example, here I'm pretending I have a data set, just 10 numbers, I want to actually split that into two sets. And I want it to shuffle first and then make that split. So in this case, I want the first set to have three elements in it and then the second set to have seven. And what's cool is that I can make this deterministic by adding the seed argument to it. So let me write it once just so you can see. And so you've seen the first set we got 261. And then when I run it again, still the same thing, right? So why is this cool? Because that's what we're using under the hood on the data modules, right? So here we have the default seed for the data module so that when you're doing this data loader, we're gonna take the CIFAR 10 training splits and then we're going to split that split into two, right? One is the train and one is the validation which we're not using there. So we're gonna use that because this argument here about split is saying how much, how many validation elements do I want, right? So in that case, when I run this, I'm gonna have here. I'm gonna have 50,000 minus 100, 1,000, whatever we want. 50,000 is the number of elements in the training split of CIFAR 10. So I'm saying, okay, if I wanna have 100 training elements then I need to set that to num train and then I'll use 50,000 minus 100 validation samples. So okay, so we have that. So again, we already have the models to find, right? So we have this, we have three models that we're gonna test right now. The pre-trained model that was pre-trained using self-supervised learning, then the pre-trained model that was pre-trained using supervised learning and then a model that is not pre-trained. So let's go there. And we don't freeze, right? In this case, the backbone. Yeah, so we're gonna unfreeze the backbones and we're just gonna train. Just to kind of summarize, this is the training step, features and then fine tune. So no freezing. So let's look at three different numbers of examples, right? So let's pick a hundred train samples, 316 train samples and a thousand train samples. And then we're going to validate only on a hundred batches of validation so that we can speed up. So we're gonna loop over this number of train samples here. And then I'm just plotting some stuff here that we wanna know. So I wanna set the max epoch. So I wanna make sure we have the same number of steps everywhere. So we're gonna limit everything to have 5,000 steps and then we're gonna derive the max epoch from that, right? So which is this guy. And then we wanna check validation not on every epoch because it's gonna be too slow. We're gonna do it every five... Well, we wanna check it five times within the number of training epochs. So this is gonna tell us how often to check the validation. And then we initialize the data, right? So again, just a data module and we're gonna pull out the training data loader just so that we can print the length. So we know how many training samples we're running and then the model, right? So we're gonna, in this case, use the model that was pre-train using self-supervised learning. And then because I want to modify what it's going to show. So you asked me earlier, how do I change version in the TensorFlow Logs to whatever I want? I'm going to actually initialize the TensorFlow Logger and then set the name myself here, right? So it's gonna be SSL-, you know, the number of training that we're using. So it'll be SSL-100, 316, and then 1,000. And then everything else stays the same. The only other thing I'm gonna add on here is that I only wanna check 100 validation batches, right? So limit val batches equals 100. And then again, I don't wanna check the validation on every training epoch. So I wanna do it every, you know, end epochs. In this case, it's gonna be derived automatically. So just to make sure that we are consistent across splits. Okay, so we're gonna train this, ready? All right. Okay, so we just trained the self-supervised model. Let's go ahead and train. We're gonna do exactly the same stuff that we just did. So I just copy pasted the code from above, but now we're gonna use our supervised model. So the rest of 50 with pre-trained weights. So let's- Data sets, right? Correct, on damage net. And then we're gonna train this guy as well. And you know, I'm naming that one supervised dash, the number of training samples. So let's go ahead and train that. Okay, great. So we just trained the supervised rest net. And now we're gonna train just the random model, right? So here there's nothing pre-trained. It's everything training from scratch. So it's this guy here. And then we're gonna call that supervised not pre-trained. So let's go ahead and train those. All right. Okay, so they're all done training. Are you, why don't we just look at the logs? What do you think will happen? Let's see, I don't want to spoil it. Okay. All right, cool. So I'd like to see the accuracy for the validation set if you can show me. All right, yes. Let's pick out the ones that we wanna show. So these are all the supervised, not pre-trained, right? And then the supervised, and then we're, these are the self-supervised ones. This was just a trial as well. Okay, so these are the ones we wanna compare. So at the bottom, these guys here, these are the random ones. So nothing's pre-trained. So you can see that on average, we get like, okay, we got in the teens of accuracy here. Yeah. And then in the middle guys, these are the ones that are supervised, pre-trained. So the rest nuts. And we get in the 20s and 30s, which is great. So you can see that the transfer learning works, right? So automatically. Twice as much performance, right? Exactly. And what's interesting is that when they're trained without labels, these guys, we get double the performance of the models that were trained with the labels. Wow, impressive. So, yeah, I don't know. I guess we can't generalize too much from, you know, ImageNet to C410, but to me, this is super promising as well. All right, so let's give again the punchline. What is the point of today lesson? What did we learn? Where did we start? So we started with introducing to you the concept of transfer learning. We covered the supervised version and the unsupervised version. Here, at least we concluded in this tiny experiment that the performance using the unsupervised version are much better than we don't know whether this is, you know, actually a result that generalizes. But the point is that the unsupervised learning version allows us to train a backbone on our own data, right? Rather than instead the supervised learning, you need to actually have annotated data, which is expensive, right? So this is actually a very big point, right? Such that if you need to do something in a practical aspect that you have a company or whatever, and then you have plenty of data, you don't really necessarily have the labels because we said plenty of times those are expensive. You can still pre-train your backbone with those unsupervised algorithms that are already available to you if you use the vaults, right? So everything is already coded and checked with the authors, I think, and they compared like in terms of performance, not only in the accuracy, but also in the computations. I think, like, I think this lightning has so many, checks for being able to maintain a very high speed. It doesn't slow you down to, it doesn't slow you down. And so, again, we have all those unsupervised, self-supervised learning algorithms that are coming out just recently, like this year, they were from July or June, whatever, 2020. And they are already available here in this library. We can use them for training your model on your own data. And then we simply swap in a classifier, we just put a classifier on top of the network and then fine-tune this classifier or even the other weights using those labels we have, right? And so this was basically the summary of today's lesson. Anything that I missed? No, I think that was perfect. All right, thank you so much, William, for being with us today. It was really great. I really learned so many things from you. Thank you for having me. This is great. I hope this is useful for everyone. I think it is. All right, have a good evening or day or afternoon or morning, whatever the day you're watching this video. All right, bye-bye.