 So, let us talk a little bit about training set and testing sets, and about evaluation. So, the reason isn't that you haven't seen that before in machine learning. No, you have, undoubtedly, but bad cross-validation and bad evaluation are so endemic in machine learning and deep learning that it's really important to talk here about it. So, first thing is, overfitting is real in deep learning. You will be like, oh, but my data set is big. But don't forget about the number of parameters that you have. In deep learning, we are almost never in the domain where we should expect that we will not be overfitting. In fact, when you look during training, you will often plot how good you do on the test set and the training set. You're always doing much better on the test set. And we should expect that, because we just have so many free parameters. So, how do you know that your deep learning system actually works? And we use cross-validation for that. Cross-validation. You always want to have a validation set. What we often do is, we do n-fold cross-validation, we divide it into k pieces, and we train on k minus 1 and test on the last one. We don't do that all that much in deep learning. The reason for that is simply that the extra compute cost is much too large. But what we can always do and should do with the kinds of data that we have, is have a training set and a test set. Now, just as a word of warning, in general, we do hyperprimate optimization. We do it explicitly, where we might search through, say, the number of layers that we use. And in fact, you'll be doing that later on today. But we also do it implicitly. Now we try a ResNet, and then we try another neural network, and then we try Inception, and we choose whichever is better. That is technically hyperprimate optimization. And here's a huge problem. If I take a training set and a testing set, and I take the testing set to the site, I cannot use the data to figure out which algorithm I should use while looking at the training set. That is what I would call leakage by PhD student, where it's what you do is you like look at the test set, and then you innovate with the algorithm so that you get good on the test set. If you want to do hyperprimate optimization, you should always do that on an extra validation set. So you need to take your training set and divide it into pieces, do hyperprimate selection on those pieces. The other important thing is you need to match your validation strategies to the use case. And I just want to share some of my experience with that area of machine learning. So in medical machine learning, data is really expensive, like really expensive. So every subject that you bring in, someone needs to call them, someone needs to make sure that they come in, lab walknets to be done, and so on and so forth. So it's very expensive to get the data. And it turns out that the amount of money that NIH gives to people is usually too small to run meaningful machine learning. Just to remind that, meaningful machine learning generally means that we need at least a hundred or thousands of data points, but let's say at least a hundred. Bringing in a hundred people is really, really expensive. The right way of doing cross validation matches the medical use case scenario. The medical use case scenario is I recruit hundred people. Let's say I want to figure out how you feel based on your phone. What's the medical use case scenario? I recruit all kinds of volunteers. I do that with their phones. I get the data from them. I then build code for clinical rollout. And when I do the rollout on the clinical side, I will not have the ability to retrain on that person. I will just train on other people. So to match this situation, what I always want to do is train on other people and test on new people that I haven't seen before. This is what's called subject-wise cross-validation. However, it's very expensive because we need a hundred some people. So what scientists in the field often do is record-wise cross-validation. I take K subjects. I choose a random subset of them that I use for training. And I then predict the pieces that I haven't seen. So what that means is I might take 10 recordings from Conrad and use nine recordings from Conrad plus recordings from other people to predict what happened in that new recording from Conrad. When it comes to clinical questions like does Conrad carry a certain disease, well, every one of the recordings from Conrad will be the same. And now, if I can have data that identifies who Conrad is, I can seemingly diagnose any disease he may or may not be having. So it makes you get falsely positive results even if it's impossible to diagnose the disease based on some machine learning. It sure is possible to identify Conrad, not like he has certain ways of talking. We could use all those features, certain ways of walking. And so record-wise cross-validation is strictly wrong because it doesn't match the use case that we would ultimately have. So you want to be really mindful of the evaluation strategy. So when it comes to clinical applications of machine learning, there's little doubt that the vast majority of papers is really falling short of providing meaningful things. It always looks in the press release by the university, yay, important additional prom solved in practice, it doesn't seem to be working. So well, and part of the reason is that there's bad cross-validation strategies. At the same time, you also need to have a good comparison metric. So what do I mean with it? Here's another example. So many people do mood estimates on mobile phones, as I mentioned. They typically get an r-squared of 0.7. So they fit it to Conrad's data and will be like, yeah, Conrad, you'll be a 7 out of 10. And I'll be like, that's awesome. I'm actually feeling like a 7 out of a 10 right now. Turns out that I kind of always feel like a 7 out of 10. And everyone has their own baselines. So what's the definition of r-squared that they use? They take 1 minus the ratio of the variance of the model, basically how wrong is the model for each person divided by the variance that we see in reality. Now the model in those studies is a model specifically fit to the users. Now how does the trivia model there look like? It says, well, everyone has their own baseline value of happiness and will be close to that. So for Conrad, my prediction would be 0.7, 0.7, 0.7, 0.7. I'm going to think that Conrad is 7 out of 10 every time. And someone else might be 5 out of 10 every day. So this model is entirely trivial. It has no machine learning at all, if you think about it. And yet, the model is kind of r-squared of 0.7. Why is that? Because the difference between people is large than the difference with in-person across time. So here we have an outcome metric that makes it look like machine learning is almost magic. Like r-squared is 0.7. Why do I even ask people how they feel today? And I'm like, why do I ask my friends if machine learning can do it at 0.7? There's no reason to do anything else. Well, there is. This is completely useless information, namely it's reporting what the average is. And what we found during a large meta-analysis in that area is that there's quite a lot of machine learning that isn't better than guessing the average of a person. I also want to mention how big the effect of overfitting is. ImageNet is one of these really big data sets, countless images, all kinds of people use it. Now, Ben Wrecked in 2019 did this cool study where they re-crawled and got new images that were just like ImageNet using the same mechanisms. And what they found is that on that new test set, there was a significant drop relative to the performance of algorithms on the real ImageNet. So ImageNet, in a way, doesn't generalize all that well to ImageNet. Now, what they found, however, is that the algorithms that are best on actual ImageNet are also better at the new ImageNet. So overfitting is real, but the overfitting seems to be similar across different algorithms here. But I just want you to be aware of that. Your algorithms under all circumstances will massively overfit. So you need to be super mindful about that. So make sure you always have a train and a test set. It's possibly the main failure point in machine learning. So now you split it into train and test set, and it's time to train it.