 Okay. Hello everyone. So today's lecture is going to be about optimization for deep learning. I'm Frank Schneider I'm going to give this this lecture today so If we perhaps lost you a bit before the winter breaks or last year, don't worry. We've got a new topic We've got deep learning as you can see from the title So today's lecture by me is going to be more of a bit of an overview on training neural networks Why it's hard what we should do about this Then next week Lucas is giving a talk about a specific set of methods and Then after that we'll have two more lectures about sort of in deep learning or deep learning adjacent topics The next or the third one of in the deep learning section is given by Philip With material from Agostinos and then the fourth one will be by Julia So once again, it's a fresh topic So if we lost you a bit before the winter break if you still have to catch up on some lectures, don't worry I don't expect any knowledge from prior lectures, but we'll also have a fresh start here I will try to make some connections to the earlier lectures because some of the patterns will be the same But I don't expect any knowledge from you So don't worry if you still missing on some details from previous lectures. We'll start again So this entire lecture so today's entire 90 minutes will all be about this very simple question Namely, how do we train neural networks? So hands up who has trained or neural network before? That's pretty much all of you at least them definitely in the majority. So it's great So you already know something about it. Hopefully by the end of the lecture, you know a little bit more So this sort of deceptively simple question actually has a lot of different ways to answer it And since you all trained in neural network before before you know that there's Hardware to consider right so you we could answer this question by thinking about the hardware Do we use GPU? Which GPU do we use? How do we best use it so that we utilize it the most and stuff like that? We won't talk about hardware today Then we could also talk about software right the the question about should I use PyTorch Jax or go old school with TensorFlow all these questions. What's the most efficient implementation for this stuff again? We won't talk about this today but the third one The third way we could answer this question and tackle this is about the methods and the algorithms that we use to actually train and This is the topic of the lecture So the entire 90 minutes will just be about answering the algorithmic or providing some algorithmic answer to the questions How do we train neural networks? So since all of you trained in neural network before and maybe you've heard some lectures about deep learning You I think a totally reasonable answer to this question would be well, that's rather easy I just use atom. It's a good default and I just use the learning rate of 1e-3. That's the default in the atom paper That's the default in PyTorch and that always works right Well, I think it's not that easy So if we look at a real-world example of people training a neural network, we can see that maybe the story is a little bit more complicated So I think it was last year So an entire sort of group at Meta AI tried to train a large language model So something that is a bit similar to GPT-3 at least in the architecture So GPT-3 you probably know it because it's behind this big hype of jet GPT and so Meta AI tried to train a similar large language model similar in the sense that it uses a similar architecture and they call it OPT and So in their paper We can look at the section about training methodology basically describing the recipe that was needed to train this large language model So if we go to the section, we can see that it starts with a sentence. We use an atomw optimizer So again already it's not atom. They use atomw and They set beta 1 and beta 2 to 0.9 0.95 So I looked it up the defaults for atomw in PyTorch or in the paper are not 0.9 and not 0.95 So they use something non-default They also tell you about weight decay that they use and then they have the sentence say that said we follow a linear learning rate schedule warming up From zero to the maximum learning rate over some steps And then the very crucial sentence a number of mid-flight changes to the learning rate were also required And they actually provide a plot of this result of this mid-flight changes to the learning rate and Just looking at the schedule. It looks really complex, right? But this is the learning rate schedule that was needed to get this large language model to train So it wasn't that mid-ai just didn't know that there's a default learning rate for atom That just works and it didn't make this plot overly complicated to sound I don't know smarter than they are already But this was actually what was needed to get these large language models to train So one more example to show you how complex it is to train a neural network is that luckily for us They detailed all their struggles to train this large language model And they provided this in a form of a lock book and this lock book has more than 100 pages They're just dedicated to what they tried what they observed and what they did as a result And so they have an entire section in this lock book Just about oh the loss exploded and we don't know why so what should we do and the first Suggestion that they give in this lock book is step one don't panic So clearly training a neural network is much harder than just using Adam with some default learning rate and just letting it train, right? So we need some empirical learning rate schedule. We need some new methods and we need Sort of a more than 100 page lock book to describe what we're doing. So We so last month I organized a workshop at NeurIPS Basically getting all the experts in one room to try to talk about well, how do we train neural networks? So we had the first author of the paper that I just mentioned the OPT paper We had the first author of the or the author of the Adam paper and other experts from Google from Meta Altogether in a room to try to discuss. Well, how do we train neural network? Unfortunately, perhaps I think a very fair summary of the outcome is that currently no one really knows how to train neural networks At least not efficiently It doesn't mean that we don't know anything. We right. We have some intuitions We have some guidelines, but it's all super super vague and we can see that just from the OPT example where experts at Meta Took so long to get such a large language model to train and needed to fiddle around with the algorithms and the method A long time to actually get something to run So this is already sort of I think the key point of this lecture is that currently Training a neural network is quite hard and we need to do something about this So in my lecture, I first want to take a bit of time Explain and look at it sort of at the background on why training a neural network is such a challenging task, right? We see from what we saw from this example that it is hard and does require some fiddling around But I want to look into the methods a bit more to understand why this is needed and Maybe from the previous lectures You kind of notice the pattern in our lectures that they usually go something like this where we first Introduce what is the current state of the art method the classical approach and then maybe we describe a bit What is wrong with it and how we can improve it and for deep learning and training neural network? It's a bit different because I actually will take the entire second part of my lecture to just Find out what the state of the art is in training neural networks So it isn't here is the classic v1 classical method that we need to improve But as we will see there are lots of them and we don't really know what the state of the art is at the moment and then lastly Hopefully I can also tell you a bit about how we can improve the situation So given that training on neural network is a bit of a mess. What can we do about this? so that's the structure of my lecture and I first want to start with understanding better why neural network training currently is such a challenging task So this is just a very simple overview of what a neural network is since all of you trained one I assume you know, but just so that we get the notation or we have a shared notation and the same terms for this Lecture I will only focus on supervised learning not because my conclusions don't really extend to the other stuff But just because it's a lot simpler to explain So my setting we just have a supervised learning with a neural network Which means that we have some neural network in the middle here that is parameterized by some parameters theta We get some input and we want to produce some output and so the entire lecture will be only about Understanding how we can set this theta most efficiently to get the output that we want. That's all we want to do and So if we look at an even simple problem of how we can set the parameters of a model such that it produces a Hypothesis that we are happy with can look at this simple example That's the data that we have we have some data point X and some sort of output Y Just a handful of data and we'll do it really really simple and just consider linear regression, right? So all the hypothesis and we currently consider just straight lines So we can just start with a random hypothesis Let's say the relationship between X and Y looks something like this straight line obviously doesn't but it's the first start so since it's a Straight line it is parameterized by two parameters. I'll call them theta zero and theta one and in this Right picture we'll look at the I call it lost landscape view or the parameter view where I have our Hypothesis somewhere here, right specific set of theta one or specific specific value of theta one and a specific value of theta zero and They together form this hypothesis that is shown in green here Now we could try a second one just to see if we're better the red one different parameters different model and One very straightforward approach to finding a good hypothesis would be just to try out lots of them until we find this one Here that at least given this data provides the best hypothesis So if we actually test all of the possible hypothesis We can see that we get sort of this I should point with this we get this lost landscape here, right? So what is plotted here is for every value of the Theta's I plot how good or bad the hypothesis is with good Hypothesis having sort of a white color and bad ones blue and So effectively what I want to do if I do machine learning if I want to do some curve fitting pattern recognition, whatever All I want to do is basically find the minimum of some lost landscape of some lost function, right? So so basically that's that's what machine learning is just finding some minimum in a huge lost landscape and obviously I just look at it for Sort of linear regression here, which makes everything super simple from your network It's more complicated. The lost landscape is not two dimensional It's maybe ten dimensional a ten million dimensional or more and so the hypothesis that we can build are also more Expressive, right, but at the end it's all about finding the minimum in a lost landscape. Well, actually that's wrong So that's not what we're doing. So let's have a closer look at what we're actually doing So while we could solve the optimization problem with something like this We just started a random hypothesis and take small steps in the direction where it's going downhill and at some point We'll arrive at the minimum. That's not really what we want to do So let's have a closer look at the exact setting that we're actually in to see why this is not true And why machine learning is not optimization. Okay So I'll just with some introduce some math notation so so that we can Talk about it in a little bit more detail So again, the setting is that we want to learn a function and that function is parameterized by some theaters so same setting as before and We get some input x and we also predict some output y and we have some or there is some Relationship between the inputs and the outputs or the x and the y and the goal is that our model makes predictions that I Denoted here by the the y with a hat. Hopefully the predictions of our model fit the two targets and this fit is sort of We sort of determine whether we have a good fit by some loss function L, right so for denial Just assume that L is given That quantifies how different our predictions are from the two outputs the two targets Now we can think of this entire example in the setting of image Classifications or our inputs x would be images our outputs y would be the labels For example, is this the image of a cat of a dog of a hot dog? whatever and so what we want to minimize the goal would be to minimize the Expected loss for the true data distribution right for all images of dogs and cats and so on that would be the goal That will be what we want from machine learning to have a computer to have an image classification model that can as best as Possible predict the true label for all images out there But obviously we don't have access to this true underlying data distribution because this would be infinite So what we instead have is we have access to a finite training data set so image net Cypher 10 and this whatever To this we have access so we can test our model. We can train our model on this finite training sets That are called the train here that consists of some sort of tuples of inputs and outputs But this is a finite training data set. So this is just a finite sample from the true sort of data distribution the true Collection of all possible images of a cats and dogs. We only have access to some sub-sample part of it and so what our entire Optimization method operates on is just this empirical loss So it's not the true loss. It's not the true risk. They expected loss It's the empirical risk or the empirical loss. So we only evaluated on the data set that we have and Even this most of the time is not really true because in deep learning as you probably know we use batching So the loss that we evaluate on and keep training on the signal that we get from the gradient That determines or sort of steers our training process isn't actually the full training set But only a small some sample of it and we repeatedly draw batches from it and don't evaluate on the full image net set But only on 200 images at a time So in that sense what we operate on what we minimize Which is the empirical risk or the empirical loss is not the quantity that we actually care about So in that sense machine learning is not optimization and so from now on I will hopefully stop Talking about optimization methods for deep learning or optimizers Because that's not really what we want them to do. We don't want them to optimize. We want them to train So from now on hopefully I'm not making a mistake. I will call it training algorithms, right and training methods So SGD Adam, I will now call them training algorithms so given the fact that we Optimize or operate on some quantity namely this empirical loss But actually are interested in the performance on some other quantity namely the true loss We get sort of these specific properties of machine learning for example overfitting, right? So here you can see the training curve of three different Training algorithms in your head if I if it confuses you and your head just translate training algorithm was optimizer That's fine three different method methods if I remember correctly It's SGD a momentum version and Adam and you can see how the training loss behaves over the number of epochs So on the x-axis epochs and I will show that in a second So this is the quantity that we operate on and that our optimizer or training algorithm actually gets to see But what we are interested in is maybe something more like the test loss So how good or bad am I on unseen data again? The test loss is only sort of a surrogate for the true data distribution that we don't have access to But we can use it as a measure to understand how good or bad we are on unseen data Now if you see this probably your first reaction is oh we're overfitting We should have stopped training earlier right because training loss still going down but test loss increases So on new unseen data actually become worse Well Maybe the loss isn't even the quantity that we're interested in so for images I don't really care about the cross entropy loss of a model But I might care more about the accuracy meaning how many images do I get right or do I get wrong? And here we can actually see a different picture So naturally for training accuracy or the training accuracy increases over time But this the same thing happens for the test accuracy as well Which is a bit weird if you look at the test loss that at some point just increases and you would go like oh careful I'm overfitting So this example hopefully with this example I show you that overfitting is actually more complicated than we might initially think and something like generalization That a word that gets thrown around a lot is also a bit more complicated because we generalize in multiple ways We first sort of generalize from the training set to the test set But oftentimes we also want to generalize from something like the loss to some other performance quantity So in translation you'll usually I think train on something like cross entropy loss again But then you care about the quality of your translation and that sometimes is measured with something like blue score or so But you don't operate you don't optimize on the blue score itself for example for images It's the accuracy versus loss for GANs for example It's the same you train on a loss But what you care about is how good or bad the images look visually to a human so Overfitting in generalization is much more complicated than just going from train to test But there's also again sort of emphasizes that what we're doing with machine learning and when we train neural network is not Optimization because the quantity that we operate on the quantity that we can actually evaluate and try to minimize It's not the quantity at the end care about so maybe this is the the second big key Take away from this lecture is that optimization and training are not the same thing So sort of as a result of this whole setting of having mini batches to train with instead of the full data set And in general wanting to have something that not only optimizes, but maybe trains People have come up with lots of different methods, right? Stochastic gradient descent is probably a prime example the very simple update equation of just starting with some theta and just taking a small step denoted by this Learning rate data here in the direction of the negative gradient here a gradient evaluated on the mini batch But you probably also know the momentum variance for example with heavy ball or nester of accelerated gradients It's sort of try to take into account previous gradients and understand or learn from them how they should behave in the future And there's all these variations like rms prop and atom and so on and so on and at the end There's actually more than 150 methods, so I know because I counted so I built this table And I'm pretty sure that half of the existing methods aren't even on this table And it's already so full that you can't really see right and so if you look at the slides again on your computer Maybe in detail you might actually complain that I might double count some stuff because I think saddam or baddam is on there twice But that's actually not a fault. It's just that there are multiple methods with the same name So before we come up with new training methods Maybe we need to come up with new names first because apparently we run out of names So that is sort of the ridiculous situation of neural network training at the moment that we have more than hundred methods to choose from And the critical part is that even if we pick one of these methods for example std We still need to learn how to use it. So for example for std We still need to decide what learning rate we should use and as you hopefully saw from the OPT example That's not a trivial task and it gets more complicated if we pick atom We actually have even more high parameters to pick so for Adam We not only have the learning rate here We also have this beta one and beta two that popped up in the beginning as well And we also have something called epsilon here that was sort of Initially introduced as a safeguard against dividing by zero here, but then people thought maybe this is also a high parameter And you should also tune it So now with Adam we have four high parameters to tune So I cannot just try them all out because as you know It gets more complex to just tune for high parameters and just increases So that's sort of the ridiculous situation that we have right now We have so many methods and we don't really know how to use the methods that we have and so my suggestion and what sort of will be a bit of the core of this lecture is that We need to have proper benchmarks for the methods Understand which one actually help which one improve and which ones are maybe necessary and which ones are sort of maybe just a bit noise Right, so hopefully with some rigorous benchmarks We can actually cut this list down into something more manageable for people to actually go through because a list of 150 methods It's not something that you can go through if you want to train neural network and Then on the side of the hyper parameters, maybe we can do something different and Actually provide some new tools for users to understand how Tuning something or how changing a hyper parameter what the effect is on the training process That's sort of the approach that I want to present today is that first Let's have some more rigorous benchmarks to really understand which of these methods are necessary and Then use debugging tools to provide more insight to users on how they should use these methods and perhaps set the hyper parameters And so this next section will be all about benchmarking training algorithms And as I said before this is sort of the part of the lecture where we try to introduce you the classic method the current state of the art But this is actually an entire quest on its own to find this method and I will tell you a little bit about this So that's the question for you Let's say we have two training algorithms So in your head you can think SGD and Adam with two training algorithms I call A and B I want to understand which one is better. What do I do? What's your suggestion? What should I try first? So it's maybe a bit of a two-way question to actually answer But let's say I run algorithm a on my problem on my neural network training and I ran algorithm B on this as well How should I determine which one is better? Which number should I look which number should I compare any suggestions? One score. All right Any other suggestions? Yeah, great comment. Exactly. So I'm just gonna repeat it. So hopefully it's on the video as well There might be a question whether we should consider the resources that we spend on training as well And maybe algorithm A and algorithm B reach the same solution So it's sort of same quality of model at the end, but one is more efficient Maybe one just needs less GPUs or less time, right? And if we have this conversation for longer, I think we could come up with more ways to compare them Maybe one is just faster. Maybe one requires less tuning Maybe one is just easier to handle and so on So that's already one of the fundamental problems of comparing those methods for deep learning and understanding which Optimizer which training algorithm we should use is that we haven't really decided what the quantity that we compare them with is So we don't actually at the moment know what better means So if we have the question is Adam better than SGD Well, we first need to understand and define what better means and for deep learning This could mean multiple things, right? We could have something that is just faster so instead of Training a model taking 10 hours with a new method it takes eight hours I still get the same performance could also mean getting a better performance at the same time Could mean that it's more robust to the choices that we make like learning rate or maybe model choices So if we have another example, so again We compare area in B and I found out that a gets me 90% accuracy and B gets me 91% accuracy Does that mean that B is better or is it just noise? How do we know? Right, so I could just retrain and maybe this time a gets 91% and be gets 90% so since deep learning is Sort of what has this? So casticity baked in It's really hard for us to determine whether our results and our better performance is just noise And if we try it again, we actually get some different result Or if it's a pattern and sort of significant results and a significant difference between the methods now This is not necessarily a fundamental issue because one thing we could do is just run it multiple times and take something like the median or the mean score Maybe use some standard deviation and decide whether it's significant But it means that evaluating methods just becomes so much more expensive Because I don't have to train once I have to train train maybe ten times and do this for every method So whatever I decided to do before to compare methods I now have to do it at a cost of 10 times or 20 times or whatever and That's unfortunately not the last of the problems that we have is that at least for someone like me who wants to understand What is the best method in general? What is the best training algorithm in general? I actually have to do this on multiple problems because if I just train one model and there it's better That doesn't tell me anything whether it's better in general So for general purpose methods and that's currently the way that we use atom is that we throw it on reinforcement problems We throw it on GANs. We throw it in large language models I actually need to test all of these cases to understand whether they're better in general or just on this specific issue or the specific problem and Since deep learning is this sort of complicated area where we work with data We work with models we work with hyperparameters and do tuning It's really hard to isolate the effect that just the method has on sort of the other the training process, right? So it's really hard to isolate the effect that I get from Using a better training algorithm to all the other aspects that go with it So ideally I would also need to test it for example with different hyperparameters every method Because otherwise maybe it's just that this specific high parameter at this specific learning rate that I chose Just works better with this method But this doesn't mean that the method in general is better again It's not a fundamental problem. It just makes everything a lot more expensive And so this is something that I basically already said that usually there's some interaction between these parts, right? For example, I think this is clear that the learning rate Interacts with the training algorithm and maybe the optimal learning rate for SGD is not the same as the optimal learning rate for atom And so on but this also extends to the models that we're currently using So I think it might be a valid thing to say that perhaps the models like Resnet or the Transformer models are popular because they are easy to train with the training methods that we currently have And if I come up with new training algorithms, perhaps we could actually train different models So there is some connection between the models we use and the methods we use to train them So again, it just makes everything more more expensive And this is just sort of a last point that I wanted to make is that it's really hard to Compare actually SGD with atom or SGD with momentum because they are not Instances they're not runnable algorithms. They're families of algorithms So for example, if I want to compare SGD with momentum. Well, it depends on how I set the Sort of the beta or the here It's called row parameter the momentum parameter of this method and I could for example Just set the momentum parameter to zero point zero, which means I get SGD back so in a way momentum just is sort of a more general method than SGD and so comparing those two doesn't Make a lot of sense and what I can only do is compare specific instances of these algorithms So specifically saying I compare SGD to with this learning rate or with the learning rate tuned in this Distribution to some other algorithm. This is completely specified So you see there are a lot of challenges of comparing Training algorithms for deep learning and most of the problems just come from the fact that everything gets so expensive that we can't just Simply run a benchmark find out Within five minutes whether a method it's better, but it actually takes a lot of compute to go through a proper benchmark so Roughly two years ago. We still try to identify. What is the state of the art of training methods? So what is currently the best method to train? So that's why we try to set up this benchmarking process and try to find some solutions for the issues. I just mentioned So starting with the problems that we test sorry with the optimization methods that we tested as training algorithms Obviously, it's not possible to test the entire list that I showed you in the beginning So the entire 150 methods as I said, it's just way too Computational expensive to do that. So instead we had to pick 15 That's still doable But that means that I can't really make any claims about Algorithms that aren't on the slide here, but hopefully we pick the most important ones. So we have some classics like SGD and the momentum Methods you have RMS prop and then you have newer methods like sort of atom variations like radum or an atom It's something like out of belief that came out two years ago. I think But once again, we are only testing specific instances of these Algorithms, so we're not testing SGD, but we're testing SGD with a learning rate tuned in a specific distribution. Yeah Yeah Yeah, it's actually pretty similar and I think that could be a so I need to repeat the question first And in the beginning on the OPT paper. I mentioned that they use the atom w Why isn't that in w on here? Is it similar to this? Yes, it's very similar and Maybe there can be a philosophical argument whether atom w is a different training method or not Because basically the difference between atom and atom w is how you implement weight decay So it's more of a difference in terms of how you use the regularization strategy And again, it sort of gets into the details of whether this is a new training method Whether this is part of the sort of optimization part or is the regularization maybe more power of the model to be considered there but this perfectly gets into the issue of it's really hard to train is really hard to separate these parts and In this case, we didn't decide to include atom w, but it's certainly something that you could do. Yeah Okay, so my point is we couldn't test all of them So there might still be wonderful methods out out there that we missed which is very unfortunate But at least we could sort of focus ourselves on one or a selection of the most popular methods here Now as I said before we can't just run it on one problem and then state from there that it's in general better So we had to compute or had to create a set of some problems in this case We used eight different problems and you can see that they vary quite a lot We have some very very simple sort of noisy quadratic They're more like test problems toy problems to understand the algorithm a bit better We have some VA ease some larger image classification models Resnets and also aren't and so recurrent neural networks and you can see that the runtime Already increases if you go down and so maybe one of the larger models takes already a handful of hours to run just a single time But obviously we have to do this for all 15 training methods So if you see a training time here already you have to multiply it by 15 to just get it for all the optimization methods but a Big thing in deep learning and neural network training is that we have to do tuning right? We cannot just pick as we see saw before just one learning rate and that's the good one and we're happy with it But actually we have to consider tuning as well So for tuning we decided that we want to simulate sort of four different scenarios So maybe one algorithm is just really good at being run out of the box without any tuning And so maybe this is a really good candidate if you just want to see whether the model you set up or the data Whether this just kind of works whether there's any use in even trying to tune it And so we tested this in this what we call the one-shot setting where we don't do any tuning Just use the default high parameters to understand if you have a practitioner that maybe just has a single GPU And just only has it available for a couple of hours Maybe like which training algorithm should you use to just quickly get some results? Maybe for debugging maybe for testing purposes or something and Then if you sort of go down this list here We increase the budget and use more and more runs to tune the high parameters of each algorithm So we start with 25 then 50 and 75 so for the large budget This would be more someone who maybe an industry wants to really get the best performance They can get so they can deploy the model in practice and they have lots of GPU resources available in parallel And could just start and spawn a lot of hyper parameter different high-parameter settings in parallel So this means to run the large budget problem as sort of the large budget tuning we have to do 75 individual runs where each take five hours and we have to do it for 15 different optimization methods and There's one more thing that we also wanted to vary which is the schedule of the learning rate So this is probably something that you know that Over the course of the training process it you should maybe decay the learning rate over time or at least vary it to some degree To really get the best performance So we not only considered one such schedule But four different ones because again there might be an interaction between the learning rate schedule and the best training algorithms And so we tested the constant a cosine a cosine with worms restarts and a trapezoidal schedule So the constant is just as the name suggests we keep the same learning rate over the entire training process The cosine decay is something that slowly decays over time So that's the orange one and in green we have something that repeatedly read starts during training So you decay the learning rate, but then increase it again And the trapezoidal schedule is a combination of something called learning rate warm-up Where in the beginning you slowly increase the learning rate Basically until sort of your training algorithms are sort of moved into an area that is a bit more well behaved Keep it constant for a while and decayed at the end So we tested out basically all of the possible combination of that for example We tested Adam on the second test problem with a medium tuning budget and a cosine schedule all of these combinations in Total give us something like almost 2,000 different configurations, but since one of those configurations for example, sorry wrong button So since one of this configuration for example include doing medium tuning Means that we don't have 2,000 runs, but we have some a bit more than 50,000 individual runs that we have to do So this already gives us sort of this huge data set of Training runs to hopefully extract some knowledge out of it As I said before we also repeated Sort of the best high-parameter setting multiple times to understand where the performance is just due to noise or Understands sort of how much the results can vary just if we repeat the run again So let's look at some examples of the results. So this complicated figure Just shows the performance of every Optimizer every training algorithm. We test it. So those are the 15 ones here and each training algorithm corresponds to one Color and it shows the performance of all these methods on all a test problems. So we tested Each test problem is one of these axes here So for example told so character RNN is one of the problems so training a character sort of language model for character prediction Using an RNN on some Tolstoy data and then we can see that the black line is the best here getting a performance That is above 62.5 percent This one corresponds to RMS prop and we can see that rather at the bottom We have the yellow and orange one which are a miss bound and out of bound So we could see that on this problem RMS prop Performance better than something like a miss bound or out of bound But if you now move and look over to all the other test problems There isn't really a clear picture is there more like a big spaghetti plot where you see a lot of lines crossing all The time and if you look at what the best method is for each and every test problem actually changes a lot So This is just showing the results for the large tuning budget and trapezoidal schedule But the main picture looks the same if we look at different budgets and different schedules So the main picture is there isn't really a clear answer There isn't really a clear state of the art training method for deep learning and instead We have lots of methods that are sometimes good sometimes not so good and we also see that something sort of traditional like Adam Need to quickly switch here Point again. Yeah Something of the more traditional and maybe the most popular method at the moment Adam shown in red here actually performs quite well and so all the sort of Methods that came after like Adam or Adam that are direct variations of it and at least in our benchmark in our tested situation They don't really show a significant and consistent performance improvement over Adam And this is especially true that if you consider that what I show here is the sort of average Performance over multiple repetitions. So if I also plot error bars in here, you can see that sort of the results shrink a bit in terms of how different they are because most of the time Adam is actually within one or two standard deviation of getting a good result so I think what this entire part about benchmarking tells us is that currently There isn't really a clear state of the art method of this is how you train and learn that work, right? There are lots of different methods and which one is the best depends on your problem And it depends maybe on how much time you have on the budget How long you train and so on and we can see that in our benchmark Adam looks quite good But maybe for large language models, you need to do something else can also saw that relatively Consistently for the character RNN RMS prop is actually better And so we get this really sort of muddled picture of there isn't a clear This is the training recipe. This is the way this is the protocol to train neural networks And instead we get this very very messy situation where you don't really know what the current state is and I mean some of maybe some of the points or some of you might go and say like well You just tested it in this limited benchmark. You could have tested other methods Maybe you just needed to look at larger problems and some of the criticisms and definitely true But we're actually trying to do this now on the large a much larger scale So hopefully this year we can release the ML Commons benchmark from our algorithms working group So I together with George doll from Google chair this working group at ML Commons and really together with people from Google Researches from Google research from meta from the University of Toronto and so on Try to build the competition to measure neural network training speed-ups only due to algorithmic changes So we want to build a benchmark a competition to see how much we can speed up neural network training by Changing the algorithm building more efficient algorithms. So I see it really as an extension of the benchmark work We did before but now at a much larger scale So since we have a lot more people involved and researchers From different companies and different universities. We can build much larger scale problems So it's not only on these sort of smaller scale Problems but on larger scale image net problems WMT translation models and so on and A big difference from what I showed you before is this time It's not us that select the method and how we tune it But people can actually tell us and submit algorithms So if you have a great idea for something like an improvement to add a more entirely new method You would build your submission and send it to our benchmark and it's a sort of a competitive benchmark We are people sort of fight against each other with their method instead of us deciding how we should use them and so I think what this effort of emerald commons shows and the amount of sort of Money both direct and indirect that these companies and universities put into is that it's not just us That don't really know what the state-of-the-art and training Method at the moment is but the entire community at the moment is sort of really unable to identify Which of these training methods are the best and what the proper protocol is for training? Neural networks, so it's not just us that are looking for this manual and this flow chart on if I see this I should do this but it's actually companies that have a lot of expertise like Google or made on so on They're looking for the exact same thing All right, so given now that we have the situation where it's so many methods. We don't really know how to use them What what's this? What's the solution? What what what can we do to improve this situation? And with this next part what I want to explain is that one strategy to improve what we currently have is So once again look into the methods in a bit more detail Understand that there might be some additional information in there that we could use that is currently not used and then use this additional information Hopefully as an end result to use better algorithms that can do a lot more Automatically perhaps instead of us fiddling around and babysitting and tuning and neural network but maybe Until we have those great new methods that do everything automatically we can actually sort of Provide some of this available information in the form of debugging tools to user to be a bit of a help and so I first want to talk about the first part which is sort of what Additional information is there available that we could use to build these debugging tools or to build these training methods So if you remember back the slides where I had in the beginning about that We don't really optimize on the true or the loss that we care about the true loss But we only operate on some smaller finite training set The same is true for the gradient right we all don't evaluate the true gradient which means the gradient on the sort of the entire underlying data distribution But we only do it on sort of this empirical gradient here that is just the sum of the end training examples in our training data set and as I said before Most of the time we also approximate this by just a sum over some Individual samples from this training data set in our mini batch. So although this entire So n might be 10 million examples But we usually only look at maybe hundred 200 or thousand at a time before we take our step So what we can see here is that this it doesn't really matter if it's the mini batch gradient or the empirical gradient It's just sort of a sample from this true Gradient right and we pick n exact n samples or b samples and an average over them so this is really an estimate of the true gradient that we look at and All we use at the moment is the mean of this estimate But if you so remember back at some of the other lectures is that well if we have a mean Maybe it makes also sense to talk about the variance of this estimator, right? But that it's not really doesn't exist in pytorch and it's not available So what I want to say with this slide is that in pytorch and what Adam is using what as you do is using What all the methods are using just this mean gradient that you get from averaging over some mini batch But there is some additional information hidden in there, which is all the individual gradients, right? This is just a sum over b individual gradients from the individual training examples And so if we if I have this information how these individual gradients spread out whether they're spread out a lot or focus Very much around the mean this is additional information that I could use because it tells me how noisy my gradient is it can Tells me how certain I am that this is the gradient at the moment And I think what is hopefully also visible here is that this is not some Some quantity that is super expensive to compute because basically we're doing it already, right? We're taking our b samples and at the end we just average over the gradient So I could just look at the gradients before we average them and have them so it's not in theory It's not something that it's just way too expensive to compute But it's more like the developers of pytorch or jacks or whatever just decided This is not a quantity that we need to use and it's a bit of a chicken and egg problem where pytorch Doesn't really want to implement it until someone has shown that they can actually do something useful with it But it's really hard to do something useful with it if you don't have it available in pytorch or other frameworks And so people from our group actually develop packages or a package called backpack and there are similar packages To it where we can officially access this additional information So not only the individual gradients, but also some other quantities like variances or second-order information and so Hopefully what I can convince you of is that there is more information in this sort of compute graph available Or during the backward pass in this computation of training on neural network that we can leverage and that we can use and that is currently not Used and we can do it efficiently to get it efficiently It's actually a little bit more complicated that it might get out on the slide But it is possible as we've shown with backpack and it's not only theoretically possible to get efficiently but also practical So one question is what do we do with this additional information? Here's just one example of how we can use this additional information To sort of steer our training process a little more, right? This is just one example one simple one and I can make a few more later on so one Very popular and common thing that people do when they train a neural network is that they stare at this lost curve See it going down. Hopefully over multiple iterations here I plotted the mini batch training loss over times or over iteration and You stare at the lost curve and from this you decide what to do with your neural network Whether you should increase the learning rate decrease it or what sorry? What I show here are actually two different lost curves So one in orange and one in blue that is really hard to see because it's almost exactly the same So it's just behind so I have two lost curves. They're virtually the same Now if I look in the lost landscape, so if I look at what is actually happening in the optimization in the lost landscape I Can see that blue and orange are actually doing two completely different things So the blue curve, which is the one behind Basically jumps back and forth here in this sort of rather narrow valley, right? So we have so I with the shaded background. I denoted the loss and so the darker it is the higher the loss and You can also see these iso lines or the lost landscape at the moment basically looks like this sort of a narrow Valley and And blue starts here and basically always jumps to the other side of this valley and make some progress Towards the global minimum that is marked with a star here Now orange does something completely different starts at the exact same point But I think the arrows are almost impossible to see because they are so small So orange just makes very very very tiny steps along sort of the one of the sides of the valley and moves down So although they do completely different things in the lost landscape their lost curves looks pretty much the same So what the lost curves lost curve tells me is if my network is training or it's not But it doesn't tell me why Right, so it doesn't tell me what I should do So for blue a reasonable thing to do is just take one step with half the step size before and then I would end up at This point here here, which is actually quite a good point and for orange The obvious strategy is to increase the learning rate and make some larger steps to actually make some progress So the lost curve tells me if my network is training or not if I look in the lost Cape lost landscape Actually, I could actually see why it's training or what I should do to improve it Problem is for no networks. I can't really look into the lost landscape Lost landscape has tens of millions of dimensions hundreds of millions and it's ever more, right? So it's impossible to look into the lost landscape. So even if I do some random cuts through the lost landscape It doesn't really tell me anything So although this would nicely tell me what I want to do For practical neural networks, this is just way too much information These are tens of millions of numbers if I look at the trajectory and it doesn't tell me what I want to do Which one what I want to know which in this case should I take a larger step or a smaller step next? But The quantity that we are interested in in this case is actually rather simple because it's just about Should I do or how large is my step in relation to the loss that I'm seeing at the moment, right? So with this plot I want to introduce a quantity and it's a little bit hard to understand Unfortunately, or my explanation is always a bit hard to understand. So I'm just trying to give you sort of a visual definition of the quantity Which is Actually show it in three different scenarios. So for now, let's just focus on the left plot, right? What I do is I observe the loss at some point theta t Then I take a step in for example in the direction of the gradient or in the direction that Adam suggests for me It doesn't really matter to some other point theta t plus i so this is just one step in in the loss landscape And since I'm just doing one step all my updates are happening in this one dimensional direction, right? So it doesn't matter how high dimensional my neural network is if I just plot one single step of my Optimization procedure always one dimensional and at this point I observe another loss in this case my loss is a bit lower than the loss I started with and what I also observe is sort of whether my loss landscape still goes down or goes up, right? So this is sort of illustrated by this. Hopefully it's visible by this black line here Which is the slope in the direction that I'm currently stepping. So what I'm basically doing is I start here I evaluate my loss I also check what the slope is in the direction that I'm stepping should hopefully go down Otherwise doesn't really make that much sense perhaps to step in this direction Then I take one step to a different point here I can evaluate the loss again and can also determine whether it's still going down in this direction or would go up and Depending on what I observe whether it's still going down or not I can probably build some model of how the loss landscape behaves in this direction, right? So in this case I observe a loss that is lower than the one I observed before and I see that the Slope is still going down, which means probably the loss landscape looks a bit like this blue line here and probably the minimum of this Loss in this direction in just this direction that I'm stepping in is probably quite far away from the point where I stepped to and So we call this scenario understepping and we denote it by an alpha that is negative Again, I could get the a total opposite Observation which would be I observed some loss I stepped to another loss that in this case is pretty much the same and suddenly I see that the slope is going up so I probably overstepped the minimum and This is what we call overshooting and we denote it by a positive and larger than zero alpha And then obviously there's also the situation in between where I actually stepped pretty much close to the minimum in this direction, right? So all I want to do with this alpha is characterize whether I'm understepping whether I'm sort of minimizing or whether I'm overshooting And I do that by only considering the loss in this single direction So again, this alpha is always just a scalar number and it doesn't matter how high dimensional the neural network is So one thing that you might think if you look at this as well if you build this quadratic model here You actually have four observations You have a loss here and a slope and you have a loss here in the slope But you're building a quadratic model that doesn't really work Well, all of the observations that we're making the loss as well as the slope they're noisy, right? Because we only observe them on a mini batch and so we can use the information from all these Individual observations from each sample to actually calibrate this model a bit better and understand which of these Observation we should trust because it's less noisy and which we should maybe put less trust in because it's quite noisy Okay But the one thing that you should remember is that this alpha is a scalar quantity always a scalar quantity that we can efficiently compute and That tells us how small or large a sort of how small or large our steps are compared to the loss that I'm observing in this direction and negative values means I'm understepping and Positive values mean overstepping where one means I'm exactly jumping to the other side of the valley And so we can see that plotting this alpha distribution Oh by the way, why is it called distribution? Because we look of we look at alpha in each individual iteration So if you just track all the alphas I get a distribution over each step In this case orange and blue they behave pretty similar all the time Blue is always pretty much stepping to the other side of the valley So it gets an alpha distribution that is very close to one here and orange gets an alpha distribution That is very close to minus one. So this quantity can tell me exactly whether I'm understepping or overstepping or doing something in between So the crucial point here that I want to make is that again the loss curve tells me if I'm training or not But doesn't tell me why Looking at the lost get less landscape would theoretically provide me this information, but it's so high dimensional It's so many information that I don't really sort of it's way too many information to make sense of it But if I can build some quantities that compress this information into something that is meaningful into sort of a status report I can I can actually get the information that I'm looking at I can actually get the information that I care about which is should I take larger or smaller steps, right? So this is just one example of what we can do with additional information And the bigger point that I want to make here is that we could use this additional information to build new debugging tools for deep learning So the current debugging tools for classical programming by which I mean everything that doesn't involve machine learning or data Looks a little bit like this. We write functions. We write scripts We write classes and when they don't work the way we hope they work We could theoretically look at the zeros and ones of our code, right? All the information why our functions work the way they work would be Visible in the zeros and ones, but this is way too much information Not something that we can compress or understand and really make use of as a human And so what we do is we use debuggers They take this low-level information compress and condense it into a sort of meaningful status report Which in the sense or in the in the case of a traditional debugger would be it tells you the Values of each variable in a function or maybe if you use a profile or some run times and so on and I think or We think that we need something similar for deep learning So in deep learning we do model training Which is perhaps arguably a bit more complex than just writing functions because we have data Because we build models and because we have to worry about hyper parameters and so on and So we can get different bugs in model training that we can get in classical programming. So for example, I could just mess up the learning rate of my training and get something that still compiles It still runs doesn't it doesn't throw me any syntactical error But my training might be super slow or even just explode, right? So I could get sort of these silent bugs that just make my training very slow and inefficient and maybe my final model Very bad very poorly performing Without them being super visible. And so we have different types of bugs compared to classical programming So we also need different tools to find them So if in the case of deep learning or model training I use a classical debugger that actually looks a lot like looking at the zeros and ones from before Because with a traditional debugger I could look at the let's say parameter values of my neural network in each iteration Right all the information is there I could look at the exact pixel values of all the images in my training data set But this doesn't give me any insight on what I should do It's just a huge list of numbers that I can't really make any sense of just like the zeros and ones So what we need is something in between that takes this low-level information and compresses it into the relevant status report that I want as a user and This is something that you could call perhaps deep debugger and with a tool that we call cockpit I think we have found something of a proof of concept that these types of tools Can exist that they can be efficient and that they can actually dig up some relevant information for the user So what does this cockpit look like well if you train a neural network and you or sort of use Cockpits to look at the training process through its lens Training looks something like this This is huge again looks a bit like you're getting overwhelmed with information But if we step through it one by one it becomes much clearer so At the bottom you see the performance plot which shows the training loss going down over time and You see the accuracy is about training and validation accuracy is increased over time So by the way, this is sort of just a looped Animation so this is basically what you would see if you would live monitor the training and just restarts after it's done So this performance plot is probably what people look at at the moment all the time, right? They stare at this It's basically the loss curve plot They stare at this if the accuracy is goes up when the loss goes down everything is good and if not, you need to do something and Probably this is also something that put people look at which is sort of what are the current hyper parameters that I'm using In this case, we're just plotting the learning rate, which we sort of gave the cyclical schedule here But now we want to extend sort of the view of a user from just these two plots All these additional plots at the top So this is basically the benefit of cockpit that it augments your view With these additional instruments. There's actually more than these but this is sort of the default view that you get very similar to what a Pilot in an airplane would do is he has all these complicated instruments that tell him or her about the state of the airplane and Instead of basically just looking out of the window and seeing if everything works fine, right? So that's what we hope to achieve is to augment the view of a viewer who's training a neural network with some interesting plots That tell them more about what is currently going on similar to what is happening in an airplane and so In the blue column we have some instruments that tell you something about the step size So how large or small are the steps that you're currently taking for example at the very top We have this alpha distribution that I just talked about We also have plots about the distance So how far away from the initial initialization am I currently in my optimization trajectory? So it's basically the difference between where I'm currently at and where I started Which can also tell me a bit if I'm perhaps stuck in some Local area and need to move on to some some other place There's also the gradient norm here, which is just as the name suggests the norm of the gradient It can tell me a lot about how much signal. I'm currently getting to sort of direct my training process And so one thing people are often worried about for example in training neural networks is getting stuck in some local Minimum but if you look at the gradient norm We can actually check with whether this is true or not because this tells you whether you're stuck in a local minimum or not then there's an entire column in the orange one just about Gradients and especially about the fact that we own not only have a mean gradient But also individual gradients where we can learn something from so for example all these It's actually basically three different quantities and green and blue and orange our gradient tests That basically tests how much noise there is currently in your gradient that you're getting So basically what they're telling you is should you increase your batch size or decrease it or what should or is it fine? There's also some histograms about how these gradient sort of Distribute and then there's an entire green Column about curvature information and I'm not going to talk about this today because Lucas lecture next week will be focused all about curvature So he will talk about this in a lot more detail So what I want to show you next with the next few slides is that these additional quantities here at the top and sort of some variations of them that That are also part of cockpit that they can actually not only sort of overwhelm you with these lots of Information, but that they can actually dig up some relevant stuff and perhaps actually help you find some bugs in your neural network training so This is an example of finding something that we call a data bug So something that Results in inefficient training that is caused by the way that you handle your data So probably one thing that you should know is that Every time you use data in a neural network or the general and machine learning you should normalize it So meaning that it sort of distributed between zero and one or something like this but Usually not all the time Depending on how you set up your data arrives in this sort of raw format where for pixel values There's often between zero and two hundred fifty five Now if you look at the data so one in this raw format and once in this normalized format If you just look at it in this case, it's just mudplot lips in show command You don't really see a different right visually. They're the same And that's because mudplot lip just takes care of it and if you visualize it, there's no difference But your neural network will be affected by whether you use raw data on normalized or normalized data And you can actually see this if you look at this gradient element histogram of cockpit for the normalized data All the raw data and you can already see that the raw data is sort of spread out a lot more in fields Maybe intuitively a bit more weird than this nicely behaved histogram here Now in general this might not or doesn't always have to be a bug and a big problem But it means that it probably affects your optimal hyper parameters, right? For example, you're a learning rate and so if you hope that the default learning rate of Adam works You probably want to use normalized data But it's something that is easily missable because even if you just visually inspect your images you might not see it another example of Finding bugs and in this case, it's a model bug which again means it somehow affects training negatively and it has something to do with your model Is here where we can have two or we see two networks one is in blue and one is in orange They seem to have a pretty much the same gradient element histogram But one of them and it's the orange one is training much worse So it's a lot harder to train blue trains nicely and orange trains pretty badly So this might be a case for the gradient element histogram is not the full story but what you can do with cockpit and again, it's sort of available additional information is that you can look at the Same histogram, but for each layer, right? That's a sort of a big property of neural networks that they didn't decompose into layers And so we can look at the same histogram layer by layer So these is for this first layer that is the closest to the input This one is the last layer closest to the output and we also shown one in the middle and what you can see is that for the orange network a bunch of them actually are super degenerate and Only the last layer sort of looks nicely at least here But since most of the parameters of this network are in the last layer It also dominates sort of this network wide histogram So that's why you cannot see this in the network wide histogram. We have to look at it layer by layer and So with this you can sort of already see that okay Something might be wrong with my model because I initially get some gradients But they sort of degenerate throughout the model. So probably there's something wrong with the model and Exactly, that's the case because the difference between the blue and the orange one orange one is the activation function that you use Right and so just by switching between sigmoid and relu you can either get a network that trains really nicely or get something that doesn't really train at all and This is something that again. It's like super hard to find if you don't have any tools for it because it's just trial and error But if you at least have some idea of where the bug might be found in this case It at least looks like something is wrong with the model can at least sort of Understand a bit more of where not to look. Let's say like this and Lastly we have a case where cockpit helps you with something that we call hyper parameter bugs Which essentially means you sort of messed up tuning you maybe miss specified the learning grade And it's actually also a case where it can help us inspire new research and understand methods there So again, I'm looking at the alpha quantity from before the only relevant information that you need to remember is that a negative alpha means that we're understepping and positive alpha means that we're over shooting or overstepping and Something around zero means we're actually getting quite good to the or making steps that are quite close to the optimal Step in this direction So what we plotted or shown in this? illustration above is the median alpha over a training run, okay So this is looking at all the alphas that we observed during a training run and just taking the median over it We did look at different problems. These are visualized by these different colors So for now, that's for example only look at sci-fi 10 and this 3d 3d is just sort of the code of our model that we use So for now, let's only look at these sort of purple color dots we have multiple of those because we tested multiple different learning rates and The larger the bubble the larger the learning rate, right? So you can see this one actually a bit larger learning rate and the smaller learning rates end up here and The higher the bubbles or the dots on the y-axis the better the performance as measured by test accuracy here And what you can see is perhaps a relationship that you would Already have guessed is that if we use larger learning rates We also get a larger median alpha because we tend to overstep more or instead of understepping for us taking two small steps But I think what the surprising part about this plot here is is that the optimal or the best performing runs Are not the ones that are closest to zero So they're not the ones where we minimizing the most They're actually runs where we overshoot and this is really rather consistent between runs between different problems, right? We can see that the best performing ones which are always highlighted with this vertical line They all fall in this area of positive alpha, which means overshooting, but even all the good ones are in this right part of the plot so clearly Minimizing in each step is not what we should do and it's that we need to overshoot some and And so this is Something again where if you would train your neural network and you're saying that a negative alpha values probably you should increase your learning rate As you can see the the best alpha value depends on the problem So I cannot tell you exactly how much you should increase unfortunately But at least if you observe a negative alpha value that's probably a sign that you're taking two small steps But besides this sort of helping the user tune their learning rate This is also really interesting because it tells us that again in Neural network training maybe minimizing is not what we actually want to do because apparently we need to overshoot and we need to do Systematically to get the best performance and this is not something that we observed only but other people observed as well for example, this is a paper by University of Toronto where they Basically looked at a very simple Problem here's the quadratic problem again We have those either lines here and the starting point here and they compared two different strategies And they compared in the setting where you can observe the gradient only with noise, right? And that's a very critical aspect and so they first come compare the method where you take or Where the steps that you take are always designed in such a way that the step size is the one that gets you the Lowest loss possible in this direction So if we start here and the gradient points us in this direction The best step size that we can take is here because this would give us the lowest loss Now if we would be in the Deterministic setting my next gradient would directly point me to the minimum and the best step that I could take is exactly here So after two steps I would be done But unfortunately in neural network training we're not in the deterministic setting Which means everything is a bit noisy which means the gradient that I really get is not pointing directly in this direction But probably just slightly down or slightly up or something And if I then decide on the step that I should take and the optimal it's actually a really small step And so if I do this repeatedly I will just sort of wiggle around in this narrow valley at the bottom and not make a lot of progress and Then they compared this to sort of a global optimal schedule So one where you just in the beginning take large steps and bounce around back and forth in this valley and only after a while Decay your learning rate until you sort of fall in the valley and then you actually a lot closer to the minimum compared to this sort of Short Short-term optimized schedule so again, this is an example for why optimization and training are two different things because in the deterministic setting we could just take Local steps that are optimal and we would end up with something that is globally optimal But in this case here We have this what they term short horizon bias in the sense that if I take something that is locally very good It might actually hurt my performance in the long run and instead I should probably or should do something like these Steps here that look very chaotic in the beginning and like very problematic perhaps But that they actually make a lot of progress in this direction here So if I became a learning grade, I am actually at the very sort of low and good point And so this also gives some background on why this over shooting is perhaps necessary to train a network because we have this noise And so we need to sort of compensate for this All right, so I'm at the end of my lecture So I briefly what briefly want to summarize the main points that hopefully you could take from this lecture So as I said before in most of the lectures previously we always had this this is the clear This is the classic method that we look at and that people use and maybe here's some ways to improve it But in deep learning, it's super unclear what the state of the art is So if you ask two people who train neural networks, how do you train your neural networks? You probably get at least 10 different answers, right? So there isn't a clear protocol a clear training recipe that you just follow to train your neural network and instead there's hundreds of them and people have all these sort of strategies and all these Conceptions on what they should do and how it should be done so there is no established standard protocol to just follow to train a neural network and Instead, there's a lot of babysitting and tuning still necessary to get your neural network to train So the second point that I want to make is that we have all these methods But it's clearly that they are so unsatisfying to use at the moment, which is why training a neural network is such a pain and The third point is that the stochasticity in deep learning is such a primary source of the challenge for this Hard task of training a neural network That or it results in the fact that training and optimizing are two different things and we want to train a neural network And not just simply optimize But luckily we think that with the additional information that we can get from something like the backward pass efficiently and Think about the gradient not only as sort of this single value But more as a distribution and think about the standard deviation the variances the confidences that come with it We can actually account for the stochasticity to first maybe build some better tools for practitioners But hopefully in the long run to also build better methods that can take a lot of this burden and babysitting off of people and Build more autonomous methods that can do a lot of this on their own So that's sort of our two-way strategy is to first think about the tools that we can provide for practitioners While we sort of wait to develop these better methods that do a lot more of this autonomous So that's the high-level summary. So I think Hopefully after this lecture you agree with me that currently training a neural network is really a mess And if you're interested in helping us improve this then let us know We're always looking for a master student bachelor student or PhD student to help us improve this So thanks a lot