 Okay, good evening everybody. So we wanna continue with sort of introduction and preparation and some reflections and we started with some general ideas of definition, how intelligence is generally understood and the relationship between reasoning and thought and action and rationality. And we started with one of the most important topics that we have in AI which is the Turing test. So and we went quite fast over it so I wanna just slow down a little bit and go back and ask some questions. So can machine think? Is that the question that Turing was asking? So because the paper that I uploaded on Learn and if you're sticking around for AI for more than a year, you should read that paper. If this is just one course that you wanna have on your transcript, that's fine, don't read it. But if you wanna be in AI community for more than one year, read that paper. So we cannot be in AI community and not have read the computing machinery and intelligence by Alan Turing. So that's the point of departure for everything basically. So is that the question that Alan Turing asked? It doesn't seem that way. It doesn't seem that way. So Alan Turing assumed we can build intelligent machines. Why am I saying that? Because you start thinking about okay, if machines are intelligence, how can we measure it? So if you would assume that it is not possible to make the machines intelligence, you would not come up with a hypothetical test which your actually is not hypothetical anymore. We do the Turing test. You can do the Turing test in many different ways. So the genius that he was, he assumed the rest of us are at his level and we will make it. We will somehow create intelligent machines. So and then the most immediate question in his mind was, okay, how can we measure intelligence? So measuring intelligence, measuring intelligence is what he is proposing. So the Turing test, some colleagues may disagree with me, the Turing test is not a philosophical test, it's an engineering test, assuming that yes, you can build intelligent machines, but how intelligent is it? It is fundamentally important for us. So Turing anticipated AI fields. Again, read that paper. It's a good read for a weekend. Searching, I especially like that because there is nothing. There is virtually nothing, no other task that is more difficult than searching. So if you don't think about browsers and we type something and the browser finds it for you, the search engine finds, no, no, no. The implicit ones. Whenever you see somebody and you remember the face, that was a search, that was a search. The challenge is, the information, the visual information has been stored in a very complicated way in the visual cortex and the search mechanisms are highly sophisticated, at least eight million years old for Homo sapiens, at least. So search is quiet. Just imagine what would be internet without search. There you go. We didn't invent search, the search was in front of us. We just took it and implemented it in different ways. Of course, reasoning. I start with these two ones because this is where we lack. This is where we still, we have problems and this is where still we fail the Turing test. We cannot still, we cannot search intelligently and we cannot reason based on knowledge that we have acquired. So and there is no network that can do this. The networks that we all like and we all are using. Knowledge representation. You cannot have an intelligent machine that is not capable of representing the acquired knowledge in a compact, lossless way. How can you represent knowledge? Big topic in deep networks is abstraction and representation. You put an image to a deep network and then the last layers, you can get some numbers out that represent the image. Thousand by thousand, a million numbers compressed in thousand. Wow. Just 1% requires intelligence. Why did we start, we say PCA principle component analysis that hopefully we start today with explaining it is intelligence as part of the short summary of the history of AI. Dimensionality reduction is sort of knowledge representation in a compact way because nobody including homo sapiens has unlimited amount of storage. So you have to take it, compactify it and then store it such that you can access it and reason with it. So of course, natural language processing, big, big, yeah? So again, search on internet, not possible. Document classification and many other stuff without natural language processing and the most communication and reasoning for us as humans happens with natural language processing. So you cannot talk about AI and not have some way of natural language processing. Computer vision, very intuitive for the masses. People understand AI when it is applied on images. Everybody understand. Look, you can click here and it search and it finds and it tells you what it is so everybody gets it. We acquire over 90% of information that we acquire as humans is visual. Only 10% is listening and touching and other stuff. 90% is visual. So computer vision is a natural vehicle of AI. So it has to be and look again at the success of recent success of AI. Most of it is in computer vision and then comes NLP. And of course, learning. So learning in general, how can we learn? So give me a million numbers and I learn from them. Okay, if I give you two, can you learn? You ask a million images of cats and dogs. How about I give you one? Can you learn from one image? Humans can. So we have the concept of one shot learning in machine learning. Can you learn from one example? Total opposite of deep learning. Deep learning is living from data. You need data, you need a lot of data. But sometimes we don't have data. And you show one image to somebody he or she learns from one example. What type of knowledge representation, search, reasoning, vision and learning should you have to learn from one example? Wow, you're not there yet. All that in that seminal paper of Alan Turing, if you read between the lines. Computing machinery and intelligence, read it. Okay, then 30 years fast forward after Alan Turing, somebody else came along, not a computer scientist, not an engineer, not involved in encryption, decryption war, a philosopher. I said, yeah, okay, so this AI things that people say, and AI was actually one of the, one of its winters. Nothing was happening. And people were completely disappointed in AI. And then as if we needed another attack, but this was not an attack. It was a clarification that we needed badly. And that came through the Chinese room by John Sewell, a philosopher. So rooted in his PhD and whatever he did afterwards. So the Chinese room, even though what he says is not about Chinese, he takes an example, but he has a point because he wanted to get a language that is very different from English. Naturally you come to Chinese, whereas there is no language called Chinese, but okay, we can forgive the philosopher for that. So you have a room, he says. I think deep learning may pass for several minutes, the Turing test, but we will fail in the Chinese room. So the Chinese room starts with this. So you have somebody here, you have a guy, or a gal, doesn't matter, gender neutral. Somebody is in there, and this person is non-Chinese. So this person is not a native Chinese speaker. He or she does not know anything about, has not even heard the word Mandarin. He doesn't know, doesn't even know where China is. So this person, pick it, pick it. So I don't wanna give the feel to my racial prejudices and so on. So anybody, anybody who is not Chinese. So and doesn't know anything about the Chinese. In this room, you have also some books. So you have one book here, which is a dictionary. So you have a dictionary of Chinese English, which contains every word in both directions. Contains every single word. And then you have another book, which contains millions of rules. And this non-Chinese person has been trained to use the dictionary, follow the rules one by one, and then this person can basically translate any Chinese text into English. So from outside, people come and put in Chinese text, and out comes English text, beautifully translated. Not the nonsense we have right now, really good stuff. Fantastic, flawless. So what, you're playing with this? This is a Duncan experiment, so what the hell? So what is that? So he creates this hypothetical Chinese room and then asks questions, can this person understand Chinese? First time I heard this, everything I knew about AI collapsed in my mind. No, of course he doesn't understand Chinese, we know that. But what goes in is Chinese, what comes out is flawless translation in English. How can it be? This was the time short after some limited success of the so-called expert systems in AI that have been abandoned by now. Maybe we have saved, rescued some of it in forms of decision trees and random forests, maybe. Some of it in fuzzy logic, maybe. But we had some limited success, especially in medical field with so-called expert systems. And there were some big computers with 80 million rules hard-coded would play chess against the world champion and would beat him. And then everybody would say, oh my God, computers have become so smart. So that was around that time that he came up with this question, the Chinese room, so can the person understand Chinese? No. Now what he did, so far he could have done that and nobody would know who John Searle is, nobody would know. The next question put him on the map. Oh, this guy has understood what is it about, although he's a philosopher. Then he asked the second question, can the room understand Chinese? Oh boy, oh boy. Now 60 years of AI is completely garbage with this question. Because yes, English comes out of the room, but we know that the room is not intelligent. So having a dictionary and having a rule book and having a non-Chinese person operating on it, why does it come English then flawless out? Why you forget the visual cortex and the frontal brain here of this guy? There is intelligence in the room, but has nothing to do with Chinese and English. It's a manual operation. You are hard coding intelligence. You can go done beat the hell out of Kasparov if you use the technology like this, or for more complicated games like Go with more advanced part of AI. But is that intelligence? Well, you wanna have one example to see what intelligence is? Because now this makes me say, oh my God, so maybe I should change my research field. Because AI, with this type of question, we'll never get there. Yes, he rules that out. So he doesn't learn anything, he just follows the rules in the book and reads the word-by-word translation in the dictionary and just follows like a robot basically. You could replace this with a robot. The problem is if you put a robot here, well, who made the robot? There's a lot of intelligence in making a robot. So he puts a person here, but he excludes that the intelligence of this person is playing any role. So just stupidly follow this rule, use dictionary and then just do it like a robot. Whereas we are using the robot in a derogatory form. So robot means stupid, doesn't know what to do. So, okay, so let's go to an example. Let's go to tic-tac-toe. Well, we wanna conquer the planet with AI, but we should be able to play tic-tac-toe. Shouldn't we? Intelligently, intelligently. Okay, so I sit down and come up with version one of to write a program to play tic-tac-toe. We wanna do AI, we wanna impress people. So we will come up with a look-up table. So which means in that field that you have is three by three, whatever it is, six by six, nine by nine, whatever that is. You come up with the table. We say if this and this is set, set this. This is hard coding, which is looking it up. So you put the rules in a table and then you just look things up. That was the IBM computer that beat Kasparov. You look things up. You look things up very fast. You have a million rules. No human being can go through a million rules. Computers can. So speed is intelligence. So which means answer for each situation, answer for each situation is hard coded. This was AI in many situations in the 60s and 70s and maybe a little bit into 80s. We would hard code stuff. We would extract some rules through interviews, getting manually from the data and then we put them in place in a look-up table. And the table could be explicit or implicit, doesn't matter, is a look-up table. You look things up. You say, now I'm in this situation, I'm playing chess. This figure is here, the soldier is here, the tower is here, so I have to do this. So you look things up. Is this intelligent? If you hard code, okay, if you hard code two pages or a million pages. So you want to get credit that you have a big look-up table, get credit for that. You are a fantastic software engineer, but this is not AI. So why not? Why? Okay, so let's see other solutions. Version two. If we cannot distinguish between a conventional solution and an AI solution for a simple problem like tic-tac-toe, how can I do that for Da Vinci, the surgical robot? How can I do that for stock market exchange? How can I do that for cancer prediction, survivability? How can I do that? If I cannot do it for this. So I have to be able to do this for a simple case. So now we break it in two parts. A, the attempt to place two marks in a row. So now I'm trying to develop a strategy. Okay, so I don't want to hard code it. I want to play and have some strategy and that strategy is interchangeably applied such that nobody can accuse me by, I just look at this situation of the game and then I look it up what I should do. I don't look it up. I have a strategy. And B, so attempt to place two marks in a row. That's the prerequisite to win. If I have two in a row and then I can make the third one and I win. If the opponent has two marks in a row, then mark the third one, mark the third one, mark the third spot or field. So I try to do my mark in two rows, two consequent spots. And then at the same time, my other strategy is to not let my opponent win. And if he has two, I will put in the third one will prevent him. Is this strategy? Is it a rule that you apply? So now, is that intelligent? Well, not sure. Is dynamic programming. But is dynamic programming intelligence? Why, if you're talking about reinforcement learning, then yes. If you're talking about Bellman equations themselves, no. So version three, represent the state of the game. You just mentioned a word that freaks me out. What represent the state of the game? So you start with a word like that, it's clear where you're going. You are trying to come up with the most generic approach to Tecto. You don't care one row is marked, two row is marked, who marked it. You wanna understand the game. You wanna understand the room. This is not understanding the room. This is just, I go, okay, so just let's do something. So what is the state of the game? What is the current word position? And you can apply this basically on anything. And what are the next legal or eligible or permissible moves? What are the next permissible moves? Again, I don't wanna take any explicit actions. I wanna understand the game. What's going on? So what is the situation of the board? And what are the next possible actions? So you're keeping it general. What do you think when a network converges and we say we check the generalization of the network? Because intelligence means you have a general concept. Now, okay, you represent the state of the game, that's half of the game, our game of playing AI. Then the other half is use an evaluation function. Another word that freaks me out. Use another evaluation function. So don't tell me one is set, two is set, three is set, which row is set. Evaluate the game as such. So define a function that evaluates the game. So tell me the state of the game, then evaluate the game. Very different, this is not hard coding. Wait, the next move. Okay, now you are bringing dynamic behavior as much as you can. So I evaluate the game, which is okay. So how likely is it to win? How likely to win? When I want to rate the next move is this. If I do the next move, which I have not said what move it is, right? It could be any move, general. How likely would it be that I win? So the evaluation function acts upon the state of the game for the ultimate purpose of winning the game. So getting the task done, getting the job done. We don't go in details when it is upon an AI solution. There is no detail when there is AI solution. Everything is generic. It cannot be explicit. Everything has to be implicit. And we want to push it. You're pushing the envelope. Look ahead if possible. So just don't be street smart. Be street smart and academic smart. So what does that supposed to mean? Just don't tell me what the likelihood is it to win, but also go in and say, if I do this, he will do this, then I will do this, then he will do this. How many steps can you predict ahead of time? How many moves can you ahead of time predict that if I go along this path, the game will end like this. When we get to reinforcement learning, we will talk about this sort of stuff. This is not for neural network. We don't have that dynamic in a trained neural network. So look ahead if possible to see estimate opponent's moves. So if you start it here, so this is an aesthetic approach. And when I get here, I have a dynamic approach. Dynamic approach. So this is a static and very specific. This is dynamic and generic. This is explicit. This is implicit. This is why this is intelligent. And this is why this is stupid from AI perspective. It may be okay because the development design of solutions like this is from engineering perspective. Actually ideal is very inexpensive. And we do that a lot. The way that bandwidth allocation on internet and cellular communication happens is all static. Most of it is based on lookup tables. Just allocating bandwidth from this channel to that channel is a lot easier to do it in a static, stupid way. It could be a efficient engineering solution. So when I say stupid, I'm not dismissing stupid. As an engineer, I may do a lot of stupid products that are very useful. We said stupid cannot be useful. I'm not using it as a derogatory term. I use it as opposite to intelligent. We said we have to be intelligent. So sometimes it's nice to be stupid and just enjoy life. So this is the difference between, so if you are here, so now the question is this, and perhaps we cannot answer this question until end of the term. Would reinforcement learning pass touring test? If yes, would reinforcement learning pass the Chinese room question? Would neural network pass the touring test? Would deep neural network understand the room? Is that understanding the room of filling half a million parameters with two million training data? Unpleasant questions. Somebody may have the question, this is an anti-AI lecture. It is not. We just wanna be on the safe ground. We have a lot of intelligent people have worked before us, have created what we have today. And when we wanna move forward, we wanna be just not falling in love with what we have. We wanna stay objective. We wanna make sure that what we do is really dynamic, generic, implicit. It can generalize. This concept can generalize. With limited change, you can apply it to any game. Of course, the state would be done explicitly, has to be defined for every game. And the evaluation function needs design effort to be defined for different games. Yes, but the rest no. Okay, we said AI is function approximation. Which means what? Which means in a high dimensional space of X and Y, I cannot draw here a 4,000 dimensional cube, hyper cube. So I minimize it to basically just one dimension, and then the output's same. So if there is a function, if I have a non-linear function like this, which is unknown, f of X is unknown. So nobody knows f of X. So the oil price goes up, there is a war in Middle East. There is a not very smart person, president of certain country. So how probable is that my stock market collapse? Now, these features that I just counted, take 4,000 of them. 4,000 attributes that affect the market. I am 50 years old. I am smoker. I eat a lot of animal products. I don't have physical activity. I have a lot of stress. How likely is it that I get lung cancer? So nobody, there is no function that I can put, is X square plus sine of X1 divided by the square roots of X2. There is no such a function that I can plug in my numbers and then give me the prediction. How likely is it that the market collapse and I get lung cancer? So we don't know this function. Now, what does the AI do? What AI does, AI will approximate this function for us. So now, if you have this, if the red squares are the regions that you have approximated and the rest not, that's not a good AI solution. You have not converged. You have overfitted. You have underfitted. You cannot generalize. You don't understand the room. So a good generalization, a good AI technique will basically just approximate everything in the resolution that we need to do it. So it will patch it all along without knowing this blue curve. Nobody knows this blue curve, but you approximate it. How much intelligence is necessary to approximate something that you don't know? Nobody knows. Is that impressive? Of course it is. Is that exciting? Yes, 100%. So function approximation is not a degradation of AI. It's the most difficult thing you can do. And AI can do it. You approximate the relationship between your inputs and your outputs. And you have 4,000 inputs and one output. Highly linear, non-stationary, a stochastic, quasi-chaotic. Noise is everywhere. So in order to do that and impress people with those red patches that approximate the unknown function, of course we need data. Of course we need data. You don't give me a equation, so give me data. So if you give me data, I can approximate it. We need data for any approximation. Historical example, linear regression. Many AI courses start with linear regression. I don't because everybody knows it. If not, it's half a page of any textbook. Just look it up, what linear regression means. Alpha x plus beta, feed the data into x, guess alpha and beta, you have linear regression. Boom, done. What do you think neural networks do? But neural networks do alpha one times x one plus alpha two times x two plus alpha three times x three plus alpha 4,000 times x 4,000. Okay, I cannot do that with linear regression. It's just, this is non-linear. I can do linear regression. So I can feed the data into a line, maybe. Maybe a smooth curve, but nothing's that crazy. Nothing that crazy. So AI then means we operate intelligently on data to extract the relationship between in and output. In and outputs, when you simplify it that way, which is not a simplification, it's getting into the core problem, when you formulate it that way, again, so many students are telling me you just take the magic out of the AI. Well, get real, this is what it is. So if you think this take magic out of the AI, you don't know how difficult that is. So maybe that's a problem. Saying AI is function approximation is not degradation, is giving the highest prize in the universe to AI, saying, okay, you can approximate this, okay. Just respect. So we need, which means we need training data for learning and we need testing data for testing. Okay, that's new. What does that supposed to mean? Why you have to give me some numbers that I can figure out the patterns. So you have to give me a table of one million rows and in every row has many, many columns. Column one to N are the X's and the last column is the Y. I will do my best to learn dispatches, to place dispatches in a hyper dimensional space that nobody can visualize and nobody can see. But then how do we know that you did your job properly? So then we need testing data. So you guess this, since we don't know the function, I cannot test you with a function. So I can still test you with the numbers. So you patch this and then I will say, okay, so XI, I come up here, you should give me YI. Are you giving me YI? Or you are giving me different numbers. So I can test you, although I still don't know the function. So we need data for training to figure out the position of the patches and then we need data for testing because nobody knows whether this solution is really okay or not. At the moment trouble with the big solutions is this. It has always been the problem. It's not new. If you come up with a decision tree, you could come up with a decision tree of half a million nodes and it was clear, definitely 90% of that has to be pruned. This is too much. You can't do it with 5,000 nodes. You don't need half a million. We have networks with 200 layers. So aren't you just following the problem? Nobody knows, you know what? Why nobody knows? Because we don't have enough testing data to just say, aha, I got you. Be careful. Maybe we don't understand the room. Maybe. Maybe we still don't understand the room. Maybe, maybe, I'm playing the devil's advocate. Maybe deep networks are just big, gigantic lookup tables. Maybe. So how can we know? What, that's implicit table, highly non-linear table, very sophisticated table. What the table? Like hashing, you look things up, don't you? How do I know that's not a table? Well, if I had access to all representative data for that application, maybe we could. The reason that we do adversarial attacks is exactly that because we don't have enough data to say, oh my God. So no, you have not learned, you have memorized. But we can't. So one of the ways that we do it is adversarial attack. Why is it that you're saying I give you a cat and you say cat? I give you a dog and say dog. And then I play a little bit with the image of the cat and give it to you and say guacamole. That's the case. It was reported. The network mistook a cat for guacamole. So please don't eat the poor animal, so. Why does that happen? Like any other hashing that your hash function is, the table positions are closed and you can easily slide into the neighboring position. That can happen with any low cop table. So our deep network's gigantic, non-linear, implicit, sophisticated low cop tables driven. Not all of it, not all of it, but the part that is very successful, the part that is the reason that most of us are here today is data driven. So what problems could arise? Well, you may not have enough data. Not enough data. You hear that again and again. My first reaction as an engineer is this. You don't have enough data. Don't use AI. No, no, no, no, no. I have to do it. I have a co-op. I have an interview. I just wanna put a personal project in my resume and. Okay. But at least don't make yourself ridiculous and train a deep network with 1,500 items. So what we can do here? What are many tricks we can do? We can do augmentation. We can do augmentation at the input level. We can do augmentation in the feature space. There are many tricks you can artificially increase the number of data that you have. I have an image so I can flip it. I can rotate it. I can change the color. From one image, I can create 50 images. It's artificial. Generally, I wouldn't do that for less sensitive applications because if you don't have data, that means to me, the nature of that application is not made for AI. Then don't do AI. Don't do AI or do one of the AI techniques that doesn't need much data. Other problem, you have too much data. Well, what is too much data? Well, you have too many features. When each data point is described by several thousand attributes and characteristics. And then in order to feed that beast with 200 layers, you need a huge architecture to be able to swallow it. So what do we do then? If you have too much information, too much data, well, the art tricks that we use, dimensionality reduction. And we just don't do it because we have too much data. Sometimes we do it because not every data point is valuable. So I may have garbage in my data. I may have correlated information. I may have noise. Just don't feed it to the poor network and say, you figure it out. How many hours of GPU we have to waste on it for you to figure out from your 4,000 features, 2,500 redundant. You want to classify cars. Horsepower, fantastic. Weight of the car, fantastic. Brand, fantastic. The color of the car, useless. So what features are relevant and will help us to understand the data, the relationship between X and Y, and what features are not, is not that easy. When you go to the real world, we don't see it. So you need techniques to go figure it out. Dimensionality reduction is one of them. We will actually start with that. Okay. So if we have, again, we will stick with the simplified case of X and Y. And we say, let's say we have some sort of data. We have some sort of X and for each X we have some sort of Y. So I have that many data. So when I come here, so that means if I look this up, this is X and this is Y. So this is one data item, one dot that we use. So I display just everything I have. Let's say you give me the Excel sheet with all thousands and millions of rows of data and I read them and I just display them. I visualize them somehow. So the question is, so let's say, actually to be able, so now if I look at this, the first question is, is this data ready to be learned? One of the things that we don't understand is the filters and the processing that human brain applies on data. We have basically zero understanding of that. What image is mapped on the retina on the back of our eyes? We have a full understanding of the eye and the way that it works. That's the reason that we have our fantastic cameras. We understand every part of it. How the image is mapped into the retina. But the moment that that information, the photonic information energy is converted to some sort of biochemical signal to the rods and cons, the photosensors, and goes through the optic nerve, we are lost. Then we don't know what's happening. Then the optic nerves take some encoded information to the visual cortex. The dimensionality reduction already happened. So what the visual cortex get, what the neural network gets, is not the raw data. Right now we are completely inclined to impress each other with, I process the raw data. Is that smart? If I can make the life on my network easier, why shouldn't I? If I can do it with 20 GPU hours, why should I do it with 500? So we should process signals as much as we can. And one of them is dimensionality reduction. Computation is expensive. You will make your own experience in whatever sector of the industry that you go, or academia, or government. So you will make that experience. That you do not have unlimited resources. You may have the fantastic idea that I have a network like this, and I will train it. But then you calculate, somebody was telling me something about medical images, and we make some rough calculation. What he was proposing would take 54 years with sophisticated computing power. So you are not serious. And I like to think that we as engineers and computer scientists, we are serious people. We wanna put a fridge in your home, an ATM that you get money from, and an airplane with which you fly, and watch movies over the cloud. We wanna do practical stuff, not just understand the room. So if I take a look at this data, and find the mean of the population, what is the average? What is the average of the data? And then find a new coordinate system that goes through this average. So this is X prime, and this is Y prime, let's say. If I do that, is there a transformation that let me understand and work with the data just with one of them? Just with one coordinate, not with two, not with 10. If I give you 1,000 features that describes the stock market, or describe the terrain that the robot is trying to navigate, or describes the factors that lead to hyperplasia and prostate. If I give you 1,000 numbers, can you make it 20? And learn with 20? Why? Again, infrastructure is expensive. We have 80 to 100 billion neurons in our head. It's a lot. But every neuron on average is only connected to 10,000 other neurons. So the local connectivity is rather low. So whatever happens, if you look at the functional MRI, things that happen are rather local. The brain doesn't like it that engage everybody because that small part of the brain is doing something. Being efficient and effective is a manifestation of intelligence. So don't waste my time and energy and computational power. Can we do this? While we can do this, for example, if you want to do this, you have to understand that there are some things called correlated vectors. So of course, if this is A and this is B, they are totally correlated. But also, if this is A and this is B, they are still correlated to a certain extent. So what is not correlated? What is not correlated from vector algebra perspective? I can't hear. If they are orthogonal, so not correlated, you may say, so they are orthogonal in multi-dimensions. Why is that important suddenly? We leave the Chinese room and talk about algebra. He cannot do AI without linear algebra, vector algebra, without matrices. He can't. Unfortunately, everybody cooks with water. There is no other liquid. If you find one, trust me, it's not on this planet because on this planet, everybody is cooking with water. So why is that important? So in order to understand the data, and this is 1901, the beginning of machine learning, basically, if you want to work with this data, you have to understand the relationships between the characteristics of the data. Horsepower, weight, brand, color of the car. Which one of them is useless? Which one of them is correlated? Which one of them is not correlated? So we have to figure that out. Figuring that out is not easy. So I don't have one vector. My vectors will be high-dimensional and I have millions of them. So it's not easy. I have to be smart about it. So what is desire? What do you mean what is desire? AI is function approximation. AI needs data. Training needs data. Testing needs data. Data is everything. Oh, okay, okay. I was about to forget it. And data can have two problems. Either you don't have enough of it or you have too much of it. Almost all of the time. So I don't know anybody or any of the colleagues that says yesterday I got a project, I had the perfect amount of data. There is no such a thing. So what is desire with respect to having data, good data to do an AI project? Well, we want to have no correlation. Has the weight of the car, is the weight of the car correlated to the maximum speed of the car? Is it? If it is, then it's useless. Don't waste my GPUs. So we don't like correlation. If things are correlated, I just need one of them. I don't need all of them. So what else? High variance. I need data that changes. If something doesn't change, so if you have one color, if every row is a car, I want to stick with the car not with stock market and cancer and whatever. So if every row is a car, and one of the columns virtually doesn't change. Well, okay, is that a good feature? It's not giving me any value. But also high variance when I look at the entire data at once in the hyper dimension. In what part, in what corner of the hyper dimensional cube, when I put all features together, do I get the maximum variance? What do you mean? Well, if you do this, if you project every point on every axis that you have. So I project all this point on the X prime axis. Just add how much and measure this distance. What is the variance of those distances? And now go around and project to the Y prime axis. Now project this. You see, that's good that we have computers. Manual stuff takes a lot of time. So this is the variance simplified, of course, for the Y prime, and this is the variance for X prime. So which axis is giving me more variance? X prime. So can I get rid of the Y prime? But you are working with X and Y. There is no X prime and Y prime. Well, let's transform the coordinate system. Okay, Mr. Pearson, how should I do that? Well, we need to do some preparation for that. So this is what I say, we cannot jump into the AI, not before we have some sense of challenges to deal with the data. And how do we evaluate our experimental design? Then we can say, okay, now show me some methods. Now I can apply it. Okay, I prefer to just clean and erase and not play with up and down of this. That's just me being old fashioned. Okay. So X prime passes through the mean of the population as the statistician would tell us. So this is a population and you have a mean of the population or sample. If it is the entire data, then it's the population. But usually we get the sample of the population. We never get the entire population, not in AI. And so X prime passes through the mean and delivers maximum variance. Max variance when samples are projected into it or onto it. So we use the mean. It has a reason that the X prime, the new axis, the new coordinate system. So we have to find a way to transform the data to bring it from X, Y to X prime and Y prime. And then in there we decide which axes are useless. And of course it's not about X and Y. It's about X1 and X2 and X3 and X5 and XN. So N is usually 1000, 2000, 4000, 5000. So you have 5000 dimensional coordinate system. So can I get with the 4000 of them? Well that's a heavy duty decision. If you do something wrong, you may kill the discriminating factor that can enable you to recognize stuff. So you need to be smart about it. The algorithm has to be smart about it. So and we do that because you get maximum variance when samples are projected onto that axis, the axis that you're processing. We process X1, then we go to X2, then to X3, then to X4 and so on. Again, we cannot visualize that. You can easily understand. We cannot visualize anything about third dimension. Which is a problem. So somebody has to come up with a solution. How do we visualize something with 5000 dimensions? So this process is repeated until we find N such axes. So X1, X2, up to XN. And if X is belonging to a D dimensional space, then N has to be much smaller than D. Then you are smart. Very simple condition for being smart. So you give me 5000 attributes, I give you back 20. So which means what? Now go in that gigantic Excel table that you got from your company or you downloaded from Kaggle, or you got from UCI or any other platform and you have 5000 columns, get rid of 4800 of those columns. Wow, that makes things a lot easier. Yeah, then you don't need to beg others for, can you give me two hours GPU? Then you may need just two minutes of GPU. So the principle component analysis, short PCA, is the process or the algorithm to find orthogonal axes that diagonalize I'm using the first word that may confuse us. We get to that, don't worry. The covariance matrix, in other words, the covariance matrix. Some jibber jabber. For us AI people, because we were used to it, I just put some rules in there and it works and magically something happens. No algebra, no matrices, no vectors, no statistics, no, well, that's not serious science, is it? So we wanna find, this is key, and when we get to the operation, we know why it has to be orthogonal, otherwise you cannot do some operations. So if this orthogonal, you have 90 degrees and some stuff becomes zero when you operate on them, multiply them, and then we come up with some matrices and then we diagonalize them, whatever that means. And then we wanna find the covariance matrix, which means roughly you wanna find out what changes with what. So if the color changes, the color of the car changes has no effect on anything else. But when the horsepower changes, things change. So horsepower and the weight of the car covariate. They change at the same time. So is it that my gene and my diet, do they change at the same time? Apparently not, because one is in my entire control. I can change my diet today. So they don't covariate. So covariance is a way that we look at it, how things change. Do they change together or they are independent? Because we wanna do this. We wanna have no correlation. And we wanna have maximum variation. Then you have good data. Then you have clean data. Then you have compact data. Then it becomes easier to train anything. So okay, what does this mean? Well, you have to spend some time to clarify what this means and we start today and we may not be finishing today. We will definitely not be able to finish it today. But we will continue because for us, the PCA is the beginning of machine learning. Not of AI, difference of machine learning. So, and there is virtually not many projects. Serious, big, large scale projects that you can do with some sort of dimensionality reduction. So this is what we are talking about here. So PCA will do dimensionality reduction. I give you 4,000 numbers for describing the car or the stock market or my health situation and you give me back just 20 numbers. And you said, look, you don't need 4,000 number, you just need 20. Don't waste your time. So just for the terminology, which will be very difficult to follow on the board. So if I use an X, of course, that's a scalar, that's a number. So if I use a bold X, that's a vector. Can I use bold X's all the time? No, I will not be able to do that. This is not word or latex that I can just bold everything. Mathematical notation is if you have regular X, that's a scalar. If you have a bold X, that's a vector. So I will just use this and from the context we have to understand this is a vector, this is not a scalar. Because otherwise I cannot really comfortably write on the board. And you will know if I say XI, that's a vector because I'm going to its element. So it's clear that it's a vector. So if I have a capital X, that's a set. Sometimes we call a set a universe of discourse that contains everything. And then if I have a X, capital X, which is bold again, this is a matrix. So capital bold X is a matrix. And again, I will not be able to do that all the time. I will not do that as a matter of fact. So we have from the context understand, if I say X, if I write X belongs to X, it's clear this is a scalar, this is a set. If I say XI belongs to X, this is a scalar, this is a vector. So even though I'm not marking it. So you will see from the context what it is, I will try to minimize that effect. But that's a huge confusion for many people who learn machine learning and want to get into the AI, the terminology. So you have to know at every moment, you have to know, are you operating on a number, on a vector, on a matrix? Just for the understanding is crucial. So if you don't know, if you lose track, just go back, go back, just backtrack and say, what was that? Oh, this was a matrix. If you're reading something and that is a good piece of scientific paper, this is meticulously enforced. And I'm not talking about blogs and whatever, comments under the YouTube videos. No, no, I mean, scientific papers. You are in good hands if you read scientific papers, the terminology is described, it is meticulously followed, there's no confusion. It could be confusion for us because on the board, I cannot really do this. So whenever possible, I throw in some explanation and say, this is a vector, this is a matrix. But most of the time, we will understand it from the context. If you don't just scream and raise your hand and say, what is this? Is it a matrix or is it a vector? What is that? And we will talk about it. Okay. And usually when you have something like this, if you have x, we usually use a vector notation, x1, x2, xn, then you have an n-dimensional input. So we usually write a vector. Some people write it in, I can't write it this way too. So x is, so you see, I'm not doing that. I'm not saying this is a bold vector because my markers, I have to buy markers every week then. So x1, x2, xn, transpose, you can do it that way too. It's a little bit based on convention, but if you define it, is it a column vector for you or is it a row vector for you, then you keep it consistent and everybody's happy. So what is a covariance matrix? Hopefully we can bring it to a stage that then we can finalize PCA next week. All preparation, this week and next week and maybe a little bit into the third week we are all still doing preparation before we learn any explicit machine learning or AI technique. So what is the covariance matrix? Covariance matrix. By the way, I will try, I will really try to keep the math involved minimal. I will try to keep it minimal. But there are two, three cases that we cannot do that. One is PCA, we cannot do that. We have to go through the suffering and torture of the equations and just struggle with it and then you get nightmare tonight and then tomorrow you stand up and say, what was that about matrices and covariance? And we have to go through that. The other one is support vector machines. We have to go through the torture of equations and of course, of course, of course, back propagation. You have to go through every step. How do we adjust weights in a normal network? These are the three major ones that you see me writing a lot of equations on the board. Otherwise I will try to minimize it. Why? Because we are covering so many topics. We don't have enough time to go in details. It's not out of consideration. It's just imposed upon us. If I want to do it for everything, we have to cut the content of the course in half such that we have enough time to really mathematically describe everything. So what is the covariance matrix? So we usually show it with a Greek symbol sum. And so we write, this is covariance curve of x i and x j. So which is, this is the i-th input and this is the j-th input. So you give me two inputs and I tell you the covariance of them. You give me the weight of the car and the horsepower of the car. And that formula, whatever that is, gives you a number. It has to be between, it has to be normalized between something and something. Minus one to one, zero to one, something. Such that we can understand it. So if this covariance, basically, from a statistical perspective, is the expected value, when we write expected value means we don't know the mean. We don't have access to the entire data to calculate the mean. Maybe I had some sample population and the average that I calculated, I cannot really call it the average because it's the average of the sample. So I call it expected value. So I'm simplifying. So the expected value, the covariance of x i, x sub i and x sub j is the expected value of x i minus mu i times x j minus mu j. So whereas mu i is the expected value of x i and mu j is the expected value of x j. In other words, simplified again. So mu i is the average for the axis i. Mu i is the average for horsepower. Mu j is the average for weight of the car. But we don't say average to be accurate because do you have the list of the older cars on the planet? No, I don't. I just have 5,000 cars from Southern Ontario. So you cannot say average. It's expected value. So being accurate is good. Some people say because we are doing AI, you have to be vague. No, you have to be highly accurate. So, which means this covariance stuff is answering the question r, x i, and x j changing together. Changing together, which means in other words, are they correlated? Are they correlated? So we will have a formula that gives us some sense whether these two numbers change together. So okay, in some way we want to use it. In what way? In negative way? In positive way? What effect has it on my principal component analysis? Whatever that is, which means, so if I select x prime here, then x prime is my principal component. And I get rid of y prime. So I want to do that. So I need some guidance to do it. And I know it's very frustrating to start the course with something like this. But again, I want to deal with the data first. And I want to spoil my AI technologies by giving them the best data possible. Don't treat them like a slave. Treat them like a prince, like all of us are. So just give them the best data you can. So to do that, oh, you're almost done. To do that, just one more. We usually work with the generalization of covariance, which is the sum, or the covariance is the expected value of x minus the expected value of x times x minus the expected value of x transposed and done. So, and if you're confused, this is good because we have something to think about over the weekend. And then next week we will finalize this. So I got rid of the indices, which means what? These are vectors. These are not scalars. And I'm using something minus something times the same thing transposed. What does that supposed to mean? What does that do? We do that in machine learning a lot. TensorFlow is full of that. You cannot do machine learning and AI unless you do something like that. But I just wanted to throw out of you because when on Tuesday we restart with PCA, we will go back and start here again and say, okay, so we wanna do principal component analysis. We wanna find the most significant component of the data because we don't wanna give garbage to AI. Yes, AI is supposed to be smart, but PCA is also smart and part of AI. So I wanna use a part of AI to clean the data and then give it to some engine that learns something. So we will do that next week. So today you will stay around for the first tutorial and then we will take it from there.