 Okay, good afternoon everybody. So today we cautiously tried the second microphone for the room. Hopefully everything works out so I don't need to yell that much in the camera's microphone if there's a problem we will go back to the old way. Okay, so today we have one more topic to cover before we jump into the actual material and hopefully next week we can go into the material. And that's a very important topic which is how to validate AI algorithm. So what you're asking is basically how do I use Turing test in practice? This is basically what you're doing. So we want to find out beyond the theoretical interests and the historical significance of Alan Turing. Okay, so he asked the question, can machine think, which clearly they cannot? And what Alan Turing suggested was, yeah, but they can imitate. And the degree to which they can imitate would be proportional to a machine intelligent quotient and you can measure it in the way that how long can the machine fool the human being? But that's still not, I cannot really do that for the test. How do we do that in practice of AI algorithm? So we usually get the data and then we train our algorithms, methods, whatever you want to call them. I want to use algorithm a little bit loosely here because a neural network is hardly an algorithm, so it's not really a predefined set of steps. And then the question is, the first question will arise after you train it. How do we know that it has learned what it was supposed to learn? It was supposed to learn. So you get a bunch of data, you select an algorithm which at the moment we don't, we have really sparse knowledge to do that. How do I know which algorithm is good for my application? And then I do some filtering and some dimensionality reduction and then I feed it into my some sort of AI agent. And how do I know it's good enough? How do I know it learned? Because I want to give it to somebody. People want to use it, people will rely on it. What about my AI agent is the fully cool autopilot in a 380? 500 people are sitting on board and flying, so I want to know that it has learned. So how do we do that? Well, I would say we should test it. We should test it. Well, of course, we test everything before we release it. If good enough, then release it. If good enough, then release it. Well, what is good enough? We cannot be vague about this, it has to be clear. We have to quantify it, we have to document it. People want even a simple thing like you want to put your website somewhere. The hosting company has to tell you, my server has an uptime of 99.9% of the time. They don't even give you 100% because, well, we can have power outage. But other than that, pretty much 100%. So, and if it cannot process certain type of inputs, you have to also convey it to the final user. So how do we make sure of this? How do I make sure that my AI agent, if I use that phrase, if the piece of AI software is good enough to leave the laboratory and go into the real world? Do serious stuff for different type of people, in the finance, in robotics, in medical, logistics, anything. So how do I know? How do I do that? Well, that's what we want to talk about because starting next week, we should be able to do this because we will learn the explicit methods that bring a little bit of intelligence into the way that we do things. So the first factor to make sure that things are good enough to make sure is to make sure that it is good enough is to have a target function, a target or objective function. I did not use the word objective or the phrase objective function alone because if I use that, people would say, oh, objective functions are reserved for optimization. AI is not optimization. Well, AI is optimization in a non-conventional way. But OK, so target slash objective function. So what do we have? Most common one is error. And if I have error, of course, I want to minimize it. So this is a big chunk of techniques that we have. We calculate the error for them, and then we have to minimize the error. That's the way that we make sure the model is good enough for the practice. Then there are techniques that use some sort of reward. And of course, if it is reward, I have to maximize it. Every time that you do something OK or good, so you get a reward. Sometimes it's about fitness. And of course, we want to have fit solutions. So again, here, we maximize it. Sometimes we get punishment. So of course, we want to minimize it. We don't want to get punished most of the time. And there are more. So perhaps most commonly used target function for us, if you look at normal networks, our error. So you calculate the error, which is the difference between the desired output and what you actually calculate. So when we get to reinforcement agent, you use reward and punishment. When you get to evolutionary concept, we use fitness. And there are others for other schemes. OK, now we usually are given the entire data that we have. We usually are given the entire data, which is the capital X being the set of all observations, Xt. And t goes from 1 to n, 1 to t, how much data you have. So usually, what you have are your inputs, which could be also your features. So either your inputs are directly your features, or you have some numbers and you calculate some features. Features are representative for data. Sometimes the data is just too raw to be processed by AI. So you have to give more expressive things to give it to an AI agent to do it. Now, this is your input, and you may or may not get output. So which is, in this case, the desired output. Again, this may be one row, and that gigantic table that we have, we get from somebody to start doing a project now. So if we only get this, we will be doing unsupervised learning. And if we get this, we will be doing supervised learning. Now, of course, just for emphasis, this is the desired output, not just A output or D output, is the desired output. For this type of measurement, I want to have this output. One of the critics that you're hearing, somebody sent me a video interview with Vladimir Vapnik, the inventor of SVM, and he, like many others, are skeptical to deep learning, which we have to be, to any method. No method is, there's no holy grail. There's science, we question everything. We got here because we question everything. So the question is, the actual intelligence, the pure intelligence, has to be unsupervised. So the human brain can detect and find patterns in data in a very small amount of data, unsupervised. We don't know what the outcome should be, but we just grasp it. But this is, as long as Homo sapiens and Homo erectus and Homo habilis are involved, at least 10 million years of evolution after we separated from the apes, at least. And how much evolution went into them over, maybe 100 million. So it's not easy to imitate that, Mr. Turek. But we thrive. You're at the very beginning. So if given the set of all hypotheses, capital H, which is set of all possible solutions, all possible solutions, find small h coming from the capital H, find one hypothesis, at least one, such that the sum of differences between my input vectors, xt, and the desired outputs for my xt coming from the superset x is small and goes towards 0. So this is xt of d. This is the desired output. Usually we get one, but it could be a small vector. Usually the vector that contains the inputs and features is very long. And then you get one or two or three desired outputs, which usually are classes. So for this measurement, this belongs to class A. For this measurement, it belongs to the class B, whatever. So another disappointing formulation or revelation about AI, this is AI. This is what we do. We minimize the error, the difference between the output that I calculate and the output that I measure. So how do we do this? Well, first of all, this is supervised. So be clear that I'm not, maybe I'm a little bit unclear about this. So this is, let's say, call it star. This is the output that you get using the xt, and this is the output that you should get because of xt, just to make it more clear. So the output that you calculate for xt and the output that you should get for xt. OK, this is supervised. This is what we do. And that's OK. So there is no shame in this. It's supervised. And basically everything that is out there right now is successfully supervised. Well, people don't talk about it. That this is very expensive. The xt is very expensive. You need people to give you xt. Somebody have to tell you what should be the output. There is a lot of intelligence in xt. If you don't have it, there is no supervised learning. So this constitutes a good fit for the model H into x. So one hypothesis H from many, many hypothesis capital H that fits my data. If the problem is big, chances are there are many solutions. There is not just one solution. And again, if you find one of them, you're happy, usually. It doesn't matter which one. So if there are 100 solutions distributed in this room and I want one of them, that's OK. I don't want to search for all of them. I just need one. So OK. Is there any concern with this concept? So I get data. And then for the data, I calculate an output. But then I compare my output with the desired output. And if there is a difference, I have to make sure that next time my difference decreases because I will guess the output again. That's fitting. That's regression. This is 60, 70 years old. Yeah, that's linear regression you have in mind. So we are talking about nonlinear regression. When I don't know the F of x, and F of x very, very difficult. So what type of scenarios do we have? So scenario number one, so what type of scenarios do you get in the real world? You have your data, and you have a solution. And proportions are intentional here. And then you train. What would happen if you have such a scenario? The data is gigantic, so I'm basically visualizing the complexity of data with the size of this irregular shape around x. So you have complex data, and you have a simple solution. Whatever simple means. Simple means for a rule-based decision tree something. It means for a reinforcement agent something, and it means for neural network something else. So if you do that, you will realize that it does not converge. It does not converge means you cannot push the error residual towards zero. Not going to happen. You stock at 5 milli. I want to go towards zero. So convergence means the difference between desired output and calculated output goes towards zero. That's convergence. So you converge to the desired output. But if you have big data, not just the amount of it, the complexity of it, and you have a simple solution, that may not happen. Which means what? The problem is small, is linear, is stationary. And the solution is too small to capture that. Oh, I'm writing for the next one. Nobody screams. Nobody screams. It just doesn't match with what I do. So the problem is big, non-linear, and non-stationary. Another disadvantage of having lectures after five. So the problem is big, non-linear, non-stationary, any bad word that you can imagine in data science. So big is big. You have 5,000 features. Not how many data points you have. What is the complexity of the feature space? It's non-linear. It means the relationship between an output cannot be described with a line, of course not, or a plane. And it's non-stationary. Which means it could be not that it's all of them at the same time, but it's some of it. Non-stationary means that the position of the solution changes every time. So if I want to use reinforcement learning to control elevators of this building, so the optimal solution is sometimes on the floor number 2, sometimes on floor number 5, sometimes on floor number 7. So the solution changes. The solution does not have a station. It's not somewhere. It's not x squared plus 2 is equal 0 that you can find and say, x is this. That's a stationary problem. Non-stationary problems are the nightmare of engineers, because you have to bring in some continuous adjustment. The solution, which we call the hypotheses, is not capable of capturing the complexity, of capturing the complexity. So you don't even get to test this. There is no touring test here necessary, because you will not get a solution. It goes, and goes, and goes, and goes, and goes. Because you said, go until epsilon is almost zero. We never say zero. That's too idealistic. We say, go until it is 0.001. And you see iteration number 2 million, and your error is still at 120,000. And it's staying like a constant line. So it's not changing. It's not going down anymore. So you're not converging. When you don't converge, you don't have a solution to test. That's one of my favorite problems, because I know, oh, OK, my model has a problem. OK, so let's make this bigger. What that means bigger? Add more layers. Add more branching. Add more states. Add more chromosomes. Whatever your solution is. OK, what is the second scenario? Scenario number 2. I get my data. I have a solution. This is nowadays a lot more the case. Then you train it. And you are really happy, because you get fast conversions. You run it in MATLAB on your laptop, and boom, after 20 seconds, boom, it gives you the result. Chances are that you are getting a crappy solution. When you get a fast solution, that's definitely useless, unless you have a very easy problem. If you have a very easy problem, what are you doing with AI? Well, I want to impress people. OK. As long as they don't have a degree, you may be able to do it. So here the problem is small, linear, and stationary. And the solution is too big, such that it completely owns the problem to use a non-AI terminology word for it, because there is a word for it in AI. So we don't say it owns the data. It completely swallows the data, basically. So you have a gigantic solution. And you have a small problem. Of course, you run it, and you converge. You push the epsilon right almost to zero, boom, within 10 seconds, or even 10 minutes, or even an hour. You converge fast. It's just too good to be true. When you get a solution so fast, something is wrong. Please double check everything. Go back. What is the problem again? How many features you have? How complicated this is? How people try to use it just with a linear regression, and they are reporting 92% accuracy done while you are using this AI? So here we say that H is memorizing X. This is what has happened. You have just too much memory. You just swallow the problem. And during training, you get 99% accuracy. It's just it cannot happen. If it is a difficult problem, it cannot happen that in the first attempts, you get high accuracy. Cannot be. Cannot be. There is no theory for that. There is no theorem that we can prove. Empirically, we know that you have to suffer for weeks to get to 85%. You cannot easily get to 80%. When you are in the 90s, oh, wait a minute. Well, if it is six months past the project and you are close to the deadline, maybe the 90% is actually something real. OK, so we decided to stay in the middle. Stay in the middle. Good. What does that mean? Now, you have to figure out how we find out. Because you could have a problem with 10 features and it's extremely hard. And you can have a problem with 5,000 features, and it's easy. Because you did not apply PCA, and most of them are redundant, and there's actually two features. So then, Mr. Turing, what is the ultimate sign, the algorithm has learned? So what is that? How do I know? Well, the ultimate sign is very simple. It can generalize the inherent XY relationship to unseen data. This is several generations, what several generations of AR researchers have figured out to practically implement the Turing test. You know that your algorithm has learned. If you say converge, convergence is a really tricky word. So how do I know? Maybe I converge fast. And my algorithm was nice. I was lucky to have the best design. Who knows? But we know that it has learned if it can generalize to unseen data. Data that was not part of the training set. If it has really learned that unknown, magical, mystical, f of x that nobody has seen, if it had really learned that, it can generate that relationship for any data. So the keywords here for us are generalization and unseen. From time to time, I encounter people, especially in conference, usually master students. PhD students don't make that mistake anymore. And they come and say, oh, I got 99%. And you see that they have used the training data for testing. And you don't want to kill their enthusiasm and say, you know, I think you should not use this. So we want to validate that. And the validation is fundamentally test for generalization. So we are getting close to become really explicit of running, touring tests for training of AI algorithms. So our idea is, the core idea is, keep one part of data for testing. Is that it? So I keep part of the data for testing. Wow. A puzzle is really easy after somebody has solved it. People work for several decades to find out, oh, OK. So take the data and keep a part of it. Don't use it for training and then test with it. But is that that simple? Well, maybe not. So if this is your data, if this is representing your data, most of the time, we break it down around 70%, 30%. These numbers are not magical. You can have 80%, 20%, 90%, 10%. Usually this part is larger because we use that for training. So we use this for training. And then we use this for testing. So we use a large part of the data to train and a smaller part of the data to test. And again, there is no magical threshold. But naturally, you need more data for training because you want to figure things out. So this has to be more than 50%. But there are other tricks we can use. Within the training, how do I know? Because here you will train one hypothesis, one model, one solution. Who says this is the best solution? You train in one iterated, one epoch, one stage, one phase, you train something. And you converge to 1.2. Is that the ultimate measure of generalization? I'm afraid not. OK, so what happens if I train for a second time, again from scratch? And this time, I get even slightly higher error what it could be a better solution. I may have slightly higher error, but my solution is better. It could be. So how do I know that? Well, you can keep a part of the data that is reserved for training, and you do validation here. So people mix that up. Newcomers to AI mix that up. So which means what? Now, you take the 70% that you hold for training, again divided in a big part and a small part. So let's say from the 70%, you take 90% for training, 10% for validation. Which means what? I train, I validate. I train, I validate. I train, I validate. I get many, many hypotheses. But I choose the one that through the validation will give me the best. Why? Then I will use the testing data to give you my final number. So I have to give you two numbers, training accuracy and testing accuracy. But in order to be able to generalize, so keep in mind, this is outside. This is inside. Inside the company, outside of the company. Inside your lab, outside of your lab. When you give it here, you don't have any control anymore. It's too late. But we even simulate this by keeping a little bit of the data for testing because I want to have the last piece of assurance that my solution is good. So I test it to choose the best hypothesis. And then I do the actual test. Yes, yes, if your data is small, it will be biased. But depending on how you do this, you will get rid of the bias. But in the training, you will find h1 and h2. You can come up with n solutions hn. Which one of them is the solution? Almost 100% of AI techniques start from random configuration. So you can start here, and it takes you five minutes to get to the target. You start there, it takes two minutes to get to the target. But the quality of the target will be different. So the validation is part of training. Before I release a model, I validate because this number is my official accuracy. I want to have the best. Because this is unseen data. This is unseen. So whatever accuracy I get on unseen data is officially my accuracy. I want to do my best. So, OK. However, this may not be reliable. The split may be lucky or unfortunate. The split may be lucky or unfortunate. This split. So you do 70%, 30%, 90%, 10%. That may be, you may be lucky. And so you grab the difficult part for training, and you grab the easy part for testing. Just lucky. And then you release it to people. People test it with real-world data, it collapses. Or you may be unfortunate because you do the difficult split. So your training data is really, really easy. But the testing data is very difficult. Contains all that non-linearity and nasty stuff of non-stationary and complexity. So you are telling me, this is basically what we do to implement Turing test. But then you are telling me it's not reliable. Yeah, whenever things like that happen, we do a lot more of them of just one time. So we have to do this many, many times. So we do k-fold partitioning. So you get the data, you grab this 30%. You get the data, you grab this 30%. You get the data, you grab this 30%. This would be a three-fold. And for each fold, you do this. You grab any of them, 30%, put it aside, do training validation, training validation, training validation, test it. You get one number. Second time, third time, you get three testing accuracies. OK, now I know. It's not about lock, it's not about randomness. First of all, if we do things properly, those three numbers should be very close. The god of randomness is extremely fair, very fair. The only condition is you've got to do it all about. Otherwise, you don't have the probability to back it up. So talking about the god of randomness, we do random sampling, of course. If it is a real problem, nobody goes and takes from 1 to 200 as the 30%, nobody does that. Just don't grab the first 30%. No, the data may be ordered. The data may be ordered in the reverse order. The data people who collect the data may have some bias and preferences and put some stuff at the beginning. So we grab one and do random sampling, and then order it. This is the first 30%. Grab another one. Do random sampling. Order it. Another fold. Grab another one. Random sampling. And then order it to get another 30%. So we have to do a lot of random stuff to make sure that we are not relying on a bias in the data or we are not experimenting based on our bias. Yes. So when we get feedback from unseen data and the data is really unseen, depending on some other parameters, this is a real feedback with respect to the generality of our solution. We are assuming that you have enough data. We are assuming that data is complete. We are assuming that everything that we are talking about is in order. You are doing random sampling. There is no bias in the data. Then yes, whatever unseen data tells us is real unless you just grab 1% of the data as unseen. OK, then it cannot be representative. And you did the one fold. You did the k fold. So the level of experimentation is that it will give us comfort. Yes. They may have, but we don't go and correct it because otherwise it's not random. We just go random. And of course, then you have to rely on a good random number generator. Of course, yes. Yes. So each one of them, again, we get the 70% to use for training, 30%, let's say, for testing. Here, the 70% is split here. Every time is just a series of experiments separately. So I train with this and test. That will generate, first time, tells me 92% accuracy for unseen. What was in the training? We'll forget about it. You can report that I had 95% accuracy for training. You can do that for academic paper, but this is the important one. We do it for this one, and you get 91%. We do it for this one, and you get 89%. And they are relatively close. OK, that's a piece of comfort. I seem to be doing OK. OK, so what? All this would work if we had a lot of data. What if we don't? So what if we don't have enough data? We have other problems. How are you training something if you don't have enough data? But you may have enough data for training something, but you don't have enough data to do the K-fold cross validation. For high dimensional stuff, that happens a lot. So 1,000 measurements for many applications is not enough for AI, but it may be enough if your data points are videos. Having 1,000 videos, oh, that's more than enough. It's high dimensional. I can get a lot more information out of it. So 1,000 to just split it in this is not really enough for K-fold, unless you do, I don't know, 10-fold, 20-fold, 50-fold. Most experiments that I have seen are around or less than 10-fold. 3-fold, 5-fold, 10-fold, quite calm. Provided you have done the rest correctly, no big mixtase in other part. So OK. If that happens, we don't do K-fold cross validation. So this is K-fold cross validation. This is actually standard. Standard way of testing anything in AI. Yes. Yeah, we randomly sample, and then just for convenience of calculation, we push it in a certain, or you can keep the indices. It doesn't matter. Yes. So this, so A and B are not the same, and C and D are not the same, because they are a result of different random processes. So if we don't have enough data, we will not be able to use K-fold. We will use leave one out. So leave one out validation. So if you have 1,000 videos, you cannot use, OK, I use 700 for training, 300 for testing. Even 10-fold may not give you much, because the data is complicated. So the type of validation also is related to how difficult is the problem that you are trying to solve. So if I get 1,000 videos, I will do leave one out, which is I will train with 999 videos, and I test with one video. Then I repeat it again, grab another video, train with the remaining 999 and test again. So I have 1,000-fold. I do that 1,000 times. Doing 1,000-fold cross-validation, nobody can afford that, generally. And we don't. Why should I? I rely on randomness, 3-fold, 5-fold, 10-fold. Good enough. I'm good. I'm secure. Empirically. Not theoretically, but empirically. But if I don't have enough, so leave one out is our best bet. OK. Good. So if you have n data samples, we will use 1 for testing, and we will use n minus 1 for training. So because n is basically the cardinality of your set, how many, how many you have. So n-fold cross-validation. Very expensive. If I could, I would always do leave one out. But I cannot do it when I have 1 million data points. You want to test 1 million times? Who can afford to do that? It's a pure computational thing. But also, with everything that we know generally from conversion theory, if you do 10-fold, you see the pattern. You don't need a million-fold. So then leave one out category, which historically comes big time from clinical setting, because you do experiments on, study on, let's say, 50 patients in a hospital. You can never get a million people. So you get 20 patients, 50 volunteers. How much experiment can you do with 50? So we do leave one out, put one patient aside, test the 49, and then do something else. Very clever. Very smart to leave one out. It can easily boost the confidence in the AI, because again, so what do you have here? You grab this one, if this is just one. And then you grab this one, and you grab this one, and so on. So you do that n times. So very expensive, very expensive. Hence, suitable for small data. Suitable for small data, not a good idea for big data. If I have big data, again, random sampling is our best friend. The only thing that can happen if you are using a bad random number generator, which, if I'm using a major programming language that should not be the case, and I'm using standard packages that everybody has used, so if I'm using the random package of Python 3.2, so I'm fine. I don't need to be worried about, is the random number generator reliable or not? These are questions that you have to ask yourself. No, no, I want to use my own random number generator that last week and I created because I had so much fun. Well, I don't do that. OK, now, if I look at the model complexity and the error. So error as a function of model complexity. So again, model complexity is how big is your AI solution? Then you train, and you see, so the training error comes down when your model become more complex. You make your network bigger and bigger and bigger and bigger and bigger, and the error comes down, comes down, comes down, goes towards zero. And everybody in the lab is jumping around, yes. And then you test it, and then you see there is a difference, and that difference is our business. So this difference is small. If the difference is small, that means well trained. So if the difference between what you train and what you test within one training session across different complexity, how you want to draw this curve is up to you. I can draw it just for one model. This could be just one model, and I'm just trying it just in one epoch. Or I'm using different models. Generally, when the model complexity goes up, the error comes down. Could be another case. Again, we have the error and the model complexity. And again, the training you see doesn't matter. The more complex it becomes, the bigger the solution H, the lower the error. So I start with three layers, and then five layers, and then 10 layers, and then 20 layers, and then the error goes down, down, down, down. And then I test it. So again, I want to have a bigger difference than we test. And we see that we have a big difference. So now I have a big difference. So it's considerable. So your model is not well-trained. It's not end of the world. So here you may get 90% and 88%. You may get 88% and 85%. Here you will get 90% and 82%. So the difference becomes bigger, becomes more considerable. What is not end of the world? End of the world comes when you see something like this. So again, model complexity, and you have your error. And then you train. And the world is beautiful, because the error goes towards zero for training. And then I test, and Jesus Christ, you get this. It's not even about difference. This guy is going crazy. It's just all over the place. So you make it bigger. In training, you were jumping around because you had 92% accuracy. And then you go with your unseen data, and you get 52% accuracy. Wow. Dropped from 92% to 52%, 40%? Buddy, you didn't learn. You did not converge to a real solution. You converged computationally, but you did not converge in terms of optimal solution. So when this happens, so this is really big. And when this happens, we say you all have overfitted. First time that we used the overword, overfitting, number one enemy in deep learning. Because we tend to overfit most of the time. OK, what does that supposed to mean? Everybody had it? No? OK. So let's say we can put a number on model complexity and say this is the cardinality of set p, where p is the set of parameters of the hypothesis h coming from the universe of discourse of all hypotheses. So in terminology of modern AI, this is your hyperparameters. This is all parameters that make up your model. If you are doing a linear regression, you have five, six parameters. You are doing a deep learning of average size. You have 150,000 parameters. It's a lot of degree of freedom. So up to 10 years ago, it was considered impossible, intractable, to go after problems with 100,000 parameters. You simply have too much degree of freedom to fill it with meaningful numbers. So OK, go back to the roots. All comes razor. Keep it simple. Deep learning came from this door. OK comes razor escaped from the window. And OK comes razor is one of the oldest wisdom we have. And it's still valid. Keep it simple. The simplest solution tends to be the right solution. If you can't do it with 22 layers, instead of 50, do it with 22. Keep it simple for us has a word in the AI literature. It's called regularization. Regularize it. Don't let this beast run away with the data. It will run away so fast. Five minutes gives you 95% accuracy. But it doesn't pass the tourist. So we said that you look at the sum of whatever output my data gives minus the output that is desired for all data that I'm reading. And we should minimize this. It should go towards zero. That's what we said. So touring, Alan touring in action. Occam's razor in action. Human wisdom in action. OK, so with all this that you told me about, careful cross-validation, and leave one out, and testing, and validate. OK, so how should I do it in practice? Well, if this is the model complexity, you should give me the best result with the smallest solution. So let's go back to the 50s and 60s. If the solution was a decision tree, the size of the solution is the number of leaves and nodes in the tree. If the solution is of artificial neural networks, is the number of layers and the number of neurons in each layer. That's the size, the complexity of your model. You should minimize the error. But please, don't use a billion layers to do that. Can you reach 99% accuracy with six layers? No. OK, go to seven layers. No. Eight. No. 10. Yes. OK, give me 10. No, I want to do it with 250, dense net. Why? Because it's much cooler, and the graphic takes so much, and I have to draw the tensors and everything. I thought we are engineers. So we want the lowest error, and we want the smallest solution. So now we do this with k-fold cross-validation or leave one out, two-ring mission accomplished in action. Now we know it has learned. It's not fake. So the so-called augmented error function will be then, so the e prime will be your error total. Whatever your error is, however you calculate it, doesn't matter. Plus a factor, let's say lambda, and this is not the eigenvalue, call it alpha, beta, anything, lambda, times model complexity. Why I'm bringing in a lambda? Because I want to control it first. So this panelizes complex solutions with large variance. Which variance are we talking about? Training accuracy, testing accuracy, training accuracy, testing accuracy. 90%, 89%, 90%, 85%, 89%, 86%. That variance should be low. There should be agreement between training and testing if everything is working according to the plan. So we may need a lambda because in some application we want to put more or less emphasis on the model complexity, and we can do it. Yet another model. But please consider that things like that, they will become the driving force for learning. This is the way that we learn. So we will force the methodology to give me the minimum error by minimum size. So that's why we regularize it. So we regulate how it works. Now, if lambda is too large, you want to play with it, play with it. You get simple solutions, simple models. If you do simple models, what happens? We can force it, right? I can put a big number on lambda and say, I really like small solutions. Now you are exaggerating in another direction. What happens if lambda is too large? So I favor it small solutions. What does that mean? You increase the bias. You will be biased towards certain stuff. You will be biased towards certain pattern and their dominance in the data. It will collapse when you go online. So hence, we use cross-validation to optimize lambda. So it's not a bad idea to have a lambda. So the regularization part, its significance and its effect on the learning can vary. But it has to be part of the validation. Otherwise, nothing should leave that concept of that you are validating. OK. So this is one way of choosing the best configuration. Nobody says you could, but nobody used that to choose between what sort of you could do. But choosing whether I want to go decision trees, random forests, or I want to go reinforcement, or I want to go neural network, that's a different choice. That's a strategic choice. For simple problems, we may run all of them and then see how they do. And we do cross-validation of all of them. For this course, we will do. And it's expected, OK, run two, three different AI methods and see which one of them is the best. But this selects the configuration of one given AI solution. So you choose neural network. The question is, what type of neural network is good enough? So that's why we are doing this. But there are other ways of doing this. Was there a question? Yes. No. The model does not know about this. The cross-validation will enforce this. I mean the lambda. Yes, the lambda. So we can play with it. We can have different values. And then the cross-validation will bring us toward that. Any model, any AI model? We can use other AI models to adjust this, yes. We could use any optimization techniques to adjust this, yes. But we usually don't. Because it's embedded inside the cross-validation. But like any other parameter, it can be optimized in many different ways. So what about another method, another model selection approach? So basically, K-fold cross-validation and leave one out are a methodology that for us to select a model. Is the model good enough? Has it learned? Is it intelligent at all? So the other method is the Bayesian approach. Bayesian approach is used if we have prior knowledge. If you don't have prior knowledge, Bayesian cannot be used. And of course, we use the Bayes rule, which we will come back to it, but just for the sake of today. You will look at the probability that the model is a correct model given the data that you have. So that's another approach to model selection. What is the probability that the model that I'm using, the network that I'm using, the decision tree that I'm using, is the best model giving this data? What is that probability? And using the Bayes rule, we can calculate that. This is the probability of data given the model times the probability of model divided by probability of the data. So when we say you need prior knowledge, so do you have the probability of data, the frequency of the data? Do you have the probability of model? How many times that model was a good model? And how many of the experiments? Do you know something advanced? Or I have to run crazy amount of experiments to find that out. The Bayes rule is a very, very powerful rule, but it cannot be applied on all possible problems. Because not for every application, we have the prior knowledge to do it. So what are other? There are some other approaches, other methods. There are methods that are called structural risk minimization. And there is a really old one that is called minimum description length. Minimum description length comes from information theory, is closely related, inspired by OCOM's razor, and can also be used for our purposes. But with all of them, so we want to hit the good solution. But most of the time, we go toward bad solutions. So if you have a small solution or you have a big solution. So for small solution, the problem is that's nice. But when I say small solution, that means too small. So you get biased. And when you have a big solution, of course, you get very high variance. You get a lot of variance. So here you may get 1990, 1990, 1990, which is very stable because you are biased. And here you get 1985, 92, 82, 95, 72. Wow, a lot of going up and down. So we want to be here. A good solution is not biased and is not variational. And the only thing we can do to make sure of it is cross-validation with the unseen data. So any mistake that we do is very, very crucial that we do not make a mistake when we validate. So I train, validate, train, validate, train, validate with some regularization. And then I choose the best one and test it. You test once for every model. You cannot run the test again. You just did the test. But training validation, we can do many times. Train, validate, train, validate, test. Test your best model, the best foot that you can put forward with respect to everything that we talked about. Not too small, not too big. So that balance is give me something that is big enough to solve the problem, but is not small to just be biased toward patterns in the data. OK, so next week we will start with clustering. And we will talk about two methods, a very old one, K-means. And we will talk about self-organizing maps. And at the latest, by Tuesday next week, we should be able to basically use everything we talked about so far. Because suddenly now we have a method. We have two methods. One is a simple iterative algorithm. The other one is the first type of neural networks. But we don't call it neural networks at this moment because we don't know neural network. This is a clustering technique that has some structure. And today we have the tutorial that we will cover PCA. So we will take it. And I will upload some reading material and some videos for everything that we have covered this week.