 So we want to continue with, as preparation, how do we deal with data before we go into different AI techniques. We said we want to start with principal component analysis as one of the oldest, probably the oldest AI technique, and we started saying something about variance and covariance and then we didn't have enough time to do it, so we will continue talking about principal component analysis. So the principal component analysis or PCA, somebody may come and say, you know what, so I can rename it easily and I can call it main features selection. And I can even come up with the abbreviation, MFS. There is no such a thing, of course. If you say main features selection, MFS, nobody knows it, but that's what PCA does. So we call it principal components because the principal components are the features that matter. And that's what we want to do because the question is, when we say which components, which for us means features. In the domain of artificial intelligence, components are features, attributes. So which components are important to keep? So this is the question that PCA answers, a question that could be for some AI techniques like deep network could be useless because deep network tell you, just give me the data, don't worry about it, I will figure it out. Which is fantastic if you have a million labeled data. If I don't have labeled data, I have to do something else. So which features are important? So what is, what is significance? Significance for PCA means variants. Whatever changes is important. Which may be confusing at the first sight when I hear it, so what do you mean? Everything that has variants is important, yes. You're talking about features that describe the instance of a class and there are measurements that state the same all the time. That means that measurement is not providing any value for me to distinguish between different manifestations of my instances of the class. So if I make a measurement and it changes all the time, that is something. So the horsepower for a sports car is one thing, for a truck is something else. So it has to change. So and for a, I don't know, SUV is a different thing. If the horsepower is the same for all cars, does that make sense? Even makes no, of course not. So the fact that horsepower is a good feature for car classification is because you have variants. Things change. So significance is variants and for us here, intelligence is recognizing the significance. So how can we recognize that something is important? We pick it up or something is not important, we eliminate it. So be clear about that if you give me 100 columns of a data. This is the starting for every AI project. So a starting point is a file. With a table. So that's where all of us we start. So you get a table of some sort. So you have X1, X2, X3, X4, Xn. And then you have the first measurement, second measurement, third measurement, you have M measurements. So you get this gigantic table. If you don't have this gigantic table, chances are you cannot use AI. If you don't have a table, a CSV file, an Excel file, a file of some sort, that says this is my data. So we usually call this, these are all features. And these are observations or measurements. So now, if something is not changing here, no change. If there is no variance across one number. So I'm really simplifying it. We will see that it is not that simple. Because what we are worried about is not merely variance. We are worried about covariance. We want to see how things change together. That's where actually the intelligence lies. So if there is no change, why should I use X3? Whatever X3 is, whatever X3 is, is the age of the cancer patient, is the number of times that the CEO has changed in a company on the stock market. Is the oil price of the last week, whatever that number is. If it's not changing, it has no value. Well, that's the simple view of it, but things work together. So how can we answer that? So that's the reason that it's not that simple. So and we said, look, so how do we calculate variance? So if we take a look at, I don't want to write the conventional formula for that. So what about I write variance like this? As the inner product of two vectors, the same vector. So I can do that, right? So X minus, when can I do this? Can anybody tell me when can I do this? Where is the average? Yes, zero mean, yeah. If you have distributions of zero mean, if you have subtracted the average, then things can become quite convenient because you move everything into the origin of the coordinate system, okay? So if you subtract the mean, then we can write variance in a nice format like this. So then the inner product of this is simply your variance, and then you just divide it by the number. So an N is just the cardinality of X, whatever X is, how many elements X has. So if I do that, so if I have X is one, two, three, and X transpose is then, of course, one, two, three. So then the variance is one over three, times what? One plus four plus nine. So very simple. Why is that important? So simple operation that you subtract the mean from your population enables you to suddenly use such convenient notation for formulating variance. Why variance? Well variance is the first thing that we have to capture change in the data, okay? But we didn't, we were talking about covariance? Yeah, okay, let's go step by step, so variance. Okay, and we said that covariance was the expected value of X minus the expected value of X and X minus expected value of X transposed. So now, what am I saying there? So I wrote that before, and that's suddenly, first of all, the expected value was confusing for us instead of mean, so you're trying to be formally correct. I may not have the actual mean, so if I go with the mean of the sample, then actually that's an estimate. It may or may not be true. And if I subtract it, so then I can get rid of it. So then am I saying that this is the expected value of X transposed? Yes. So which means the expected value is whatever numbers I have, I'm working with them and whatever you get is an expected value because you have limited observations. Any measurements that we make. And this is the main concern for generalization because whatever measurements you have, doesn't matter how big the table is, is always a sample. So, and it's really easy to break. So why is it become difficult to break some networks? Because if you train them with two million images, your sample is so big that the expected value becomes the average or approaches the average. Then it's difficult to find cases to break the network. Yes, things done. So, okay, so let's go back to good old variance. And we say this is basically the variance of X. And I'm being a little bit lazy as I told you that I'm not making distinction between scalar, vector, maybe even matrix. And we should make some effort and understand it from the context because it's limited how do you get distinguished between this on the board. So, and this is, we know that this is X times X minus expected value of X squared. We know that. Okay, so are we saying variance and covariance are the same? What, fundamentally yes. Because you measure change. However, variance is one-dimensional. You change the measurement just for X3. But covariance is two-dimensional. It measures X2 and X3. So does do X2 and X3 covariate? That's the question for covariance. But variance says, does X3 change? So they are fundamentally the same which means if I go matrix notation, they may be no difference. What does that mean? So let's say I have a vector of just three measurements. X1, X2 and X3. And if I apply this, so if I write that covariance is X transpose. And I'm writing capital X, so this is just a matrix. So, which means what? This is X1, X squared, X2, X squared and X3 squared. So the diagonal is basically what? So the diagonal of this, if I do the matrix operation, the diagonal is the variance. But everything else, X1, X2, X1, X3. This is X2, X1, X2, X3, X2, sorry, X3, X2. And this is X3, X1. And here we have X2, X3. So if I do that, the rest, first of all, we see that this part and this part are completely symmetric. And these two parts are measuring the covariance. So the same as this part. So we're basically doing the same thing, the same type of calculation. But if you measure things against the same thing, you are measuring the variance. And if you are measuring the change of two different things, you are measuring the covariance. So it becomes really nice if I do it in a matrix notation, which provided if I don't have problem with linear algebra, hopefully not. And for that, one of the few things we have to do, we have to subtract the mean. So if I subtract the mean, I shift the center of the data to the origin of the coordinate system. Things become very, what happens if we don't do that? We have not done many drawing, but just can anybody imagine what happens? Forget about the convenience of the notation and linear algebra on that. What happens if we do not subtract the mean from the sample? We are looking for principal components for significant things that change. Yes, your first component may be biased toward the mean of the population because that's the biggest change. So we have to do that. This is sort of normalization. If you don't normalize, you cannot. Most of the time you may not be able to trust the first principal component that you find, the first significant feature that you find. We don't wanna find the average. We wanna find other stuff. Okay, so we said, so if I look at this, that means if I could somehow diagonalize my measurements and my calculations, hopefully I can use the other tricks from the algebra and can come up with something that is practical. So that's the rough idea. So if this is, I don't know, if we call this C for covariance, so diagonalizing C using a suitable orthogonal transformation matrix A by obtaining N orthogonal special vectors Ui with special parameters Lundi. So now I'm using a spooky language. So first we have to diagonalize it. I hope that's enough for us to have a motivation why we need to focus on the diagonal because if I take this and build the dot product, so the diagonal is just multiplication of every element with itself, which can give us after subtraction of mean, which can give us the variance, the one-dimensional variance. So it would be convenient to do that. So if that's motivation enough, so we have to find a transformation to do that and that transformation has to be orthogonal. Why orthogonal? Why orthogonal? Perpendicular. So perpendicular in multiple dimensions. Orthogonal. Why orthogonal? Why that transformation has to be orthogonal? So I want to make a transformation. So do you remember what I drew last time? So I drew an example that is, I don't know, everybody draw that way because people are lazy to come up with other example. It doesn't matter. Let me draw it this way in other direction. It doesn't matter. Let's say this is your data. So this is X1, this is X2, this is not X and Y. These are my components. These are my features. So let's say this is my data. So if I subtract the mean, then this would be X1 prime and this would be X2 prime. So now this is my mean. So PCA is one of those cases where you sit down like SVM that we will deal with later. PCA, one of those AI techniques that you sit down and handcraft intelligence. Compared, complete opposite of reinforcement agents, evolutionary algorithms, neural networks of course, decision trees, you don't handcraft them. They have a generic framework. They swallow the data and they figure it out. So but for PCA type of method, we sat down and say, okay, what do we need to do? One, two, three, four. So is it deterministic algorithm? Well, there's a set of algorithms. There are many different ways you can do PCA. Today we are just describing the generic form of PCA, how it is done generally. So, okay, which means what? I need now, I have to take this. I need a transformation to do this. So this is X1 prime, X2 prime, and now the data looks like this. So which means what? I took this and rotated maybe, I don't know, 45 degrees to the left? So because I don't wanna look at it this way. So if I have it like this, then we know I have variance from here to here and I have variance from here to here so I can get rid of X2 prime. Because there's not much variance. So my principle component, if I have two features, is X1 prime. Is X1 prime, not X1? So you can go back and if you wanna say, no, no, no, I want my X1. So if you have deleted it, maybe you can reconstruct it, but you're not worried about that. So what is this transformation and why does it need to be orthogonal? Why does it need to be orthogonal? Do you remember that we talked about, you have two vectors, one is A, the other is B, then we have another case. This is A, this is B, and then we have another case. This is A, this is B. So they have some angle with each other, they are parallel or they are orthogonal. Yes. They are independent. I don't want redundant information. This is completely redundant. This is somewhat redundant. This is independent. They are orthogonal. They have nothing to do with each other. My principal components cannot be redundant. They cannot have, why? Why? What's bad with redundancy? Sorry? Lother beliefs? Yes, you have, which means what? Which means what? We like to compress the data. If you wanna grab principal components, that means I have 100 features, you are giving me n columns, I cannot even read it into memory because I have only 128 gigabyte RAM. I wanna just take 10 of them. Can I take 10 out of 5,000? Just take 10 columns out of 5,000 columns and still achieve 98% accuracy. If this is not intelligence, I don't know what it is. But it's simply, if there is redundancy, it will kill us. It will affect the training. It will affect the classification. You need more resources. So if you want orthogonal features, they are independent. I have less, I need less memory and hopefully with less information, my classifier can converge sooner. So learn the underlying pattern much faster. So, okay, so we have to figure out this transformation and so this transformation has to be somehow, we have to maintain the orthogonality. We have to look for orthogonal principal components. We have to find the first one and then the second one in the second dimension is orthogonal and the third one in the third dimension is orthogonal and so on. So, which means what? So we are talking about CUI has to be something like Landa UI. So my covariance matrix, now we are talking about the matrix. My covariance matrix times that special vector should be equal my special parameter times the same special vector. That's the condition that I can do this transformation, stay orthogonal, don't have redundancy, don't have garbage with me. I have only things that matter. Okay, so, okay, give me an example. So let's say you have a matrix two, one, one, two times another vector, times some vector one, one. And this is, of course, three, three, right? Two times one plus one times one is three, one times one plus two times one is three is three and I can write this as three times the vector one, one. So if this is my C, this is my special vector UI. This is my special parameter Landa and this is my special vector UI. So it's possible. Of course, we call this an eigenvector and this is my eigenvalue. Yeah, okay, back to algebra one on one. So if you can do that, and again, so if you think that's something to do that, this is one on one, no, it's not. So let me have another example. So let me say we have two, three, two, one, times six, four. This will give you 24, 16. And we can rewrite this as four times six, four. So again, we have our eigenvector and we have our eigenvalue. So as lazy as the instructors are, we usually grab a two-dimensional example on the board. Okay, now you have 10,000 by 10,000. How do you do that? Wow, there are crazy iterative methods to do that. We are not concerned with that, how we do it, how we get the eigenvectors. If you can apply it only on square matrices, you cannot do it on N by M. It has to be N by N, which happens to be, I'm doing this. So it's a square. I am not worried about that, am I? It's always a square, it's always N by N. So it has to be that, and I need some good libraries of calculating this for me and find, and not every square matrix necessarily has eigenvectors and eigenvalue. So, okay, I thought we wanna do machine intelligence. Now we are doing algebra. Wow, okay. Wow, maybe we do. Maybe we do. I didn't know they are different, but okay. So, if you wanna do a linear transformation, who said anything about the linear transformation? Well, I was thinking about it and realized if I wanna formulate this as nonlinear, it's gonna be difficult as hell. So, and we have only one and a half hours time, so let's do just linear. What PCA is linear? Has nothing to do with the length of our lecture. So doing linear things is easy. What does that mean? So if I take this and rotate it 45 degrees to the left, that's a linear transformation, right? So, if I rotate it this way, that way, this is all linear. So if I take this, anything that you can do with this is linear. If I grab a piece of paper, and then suddenly I fold it like this, this is not linear. I don't wanna deal with nonlinear stuff. I wanna keep the transformation easy, so it has to be linear. Is it a convenient assumption that we make? So let's do a linear transformation. And then hopefully everything works. As engineers, we love linear things. Linear things are simple. You can draw a line and it's done. Which means what? So my special vector, then I can write it in forms of an A times Xi minus M. A line, a line. Forget about vector, matrix, notation, anything. That's a line, right? So I can rewrite it as Xi is then M plus A transposed times the eigenvector mu i. Again, this is a simple line. If you can, for one moment, forget about that. This is a matrix. I brought it to the other side. I'm assuming that the inverse matrix is the same as transpose matrix. Why am I making that assumption? All that. So since the inverse of A is A transport. Four, when can I do that? When? Yeah? We're just talking about it, yes. Who are symmetric matrices for which this is not valid? We were just repeating a magic word again and again. Autogonal. So for autogonal matrices, the inverse of A is A transposed. Very convenient. Oh, I love this autogonal stuff. That makes the life a lot easier. If you think this is for PCA, who has experience with TensorFlow? Okay. We need to get some experience with TensorFlow, please. So take a look at the TensorFlow implementation. There is no AI without matrix operation. Not gonna happen. We can't do this. And interestingly, everything started 120 years ago with PCA, which heavily depends on it. So we can write that there is a modified C, which is A times CA transpose, such that C prime is that diagonals matrix that I was talking about. London one, London two, London N. And everything here is zero. And everything here is zero. So there should be a transformation, A, that I apply on the covariance matrix and I get this and I diagonalize it. How do we do this? We will not go into that. There are many techniques. There is SVD, single order decomposition value. There are many different algorithms. There is not a single PCA. There is a bunch of techniques that you can do that. Let's say we get here and we do that many, many times. When we get to SVM, we come to the really decisive point that you need to optimize something. I say, okay, now we give it to a library to optimize it for us. Because these are things that have been done and they are not in the center of our attention at the moment. So if I get here, then diagonalize transformation that we were talking about, it can be calculated. I'm not lying to you. This is not easy, but it's not my concern because we are not the only one who are doing these things. This has been done for the past, I don't know, a hundred years, maybe more. We have capable libraries to do that. I'm not worried about that. Give me this. Give me this please. Okay, here, you have it, calm down. Then what? So in an orthogonal transformation, the trace of a matrix remains the same. The trace of a matrix remains the same. If I'm writing too small, you have to scream in the back of the class. These are really, these are not thick enough. So I have ordered thick markers, which we don't have in Canada apparently. So I had to order from USA. So hopefully next week I can write thicker and bigger. So the trace of C, which is supposed to be our covariance matrix after mean subtraction is the same thing as the trace of the C prime, which is the transformed one. So the blue here is the C, the red one is the C prime, or the represent them at least. It's basically constant and is the sum of your eigenvalues, which goes from one to n. So in an orthogonal transformation, the trace of a matrix remains the same because that's an inherent characteristic of the data. You keep it, you don't lose it. Talk about intelligence, which you can also, interestingly, you can also write as the sum of the variances. Of course, if you do the dot product of the measurement matrix, again, it's all variance. What is, we call it variance, but it is the same data, one dimensional, or the variance of something with something else is called covariance. Does the mother nature has a different, I don't think mother nature distinguishes between variance and covariance. We have to do that just to get our equations understandable. So when you do this, this is most important and this is least important. Can you tell me how can that happen? If I do the transformation and I find the first principal component and the second one and the third one and the fourth one and the nth one, how come that the first eigenvalue in the transform matrix happens to be the eigenvalue for the most significant eigenvector, which is the first principal component. Associated to U1, U2, U3, Un. First principal component, second principal component, third principal component, nth principal component. But how does it come, that come, it is sorted. Who did sort that? I did it. I'm skipping many, many steps just to squeeze PCA in one lecture. How did that happen? Yes, but not for the reason. Why the eigenvector corresponding to the first eigenvalue lands in the first spot? Yes. And that means? And why is it the largest variance? Because we started that way. So when you first start with the blue dots here and there is no red and you go for the biggest variance, it gets the biggest variance and then the second biggest variance and then the third biggest variance. Naturally, you get that sorted. So the principal components, you have lambda one and lambda two and lambda three and lambda n minus two and lambda n minus one and lambda n. So important, useless. So now you are not done yet. You just have a matrix and you say, okay, you know what, you can break it down and you can write every vector as a linear combination of eigenvalues and eigenvectors. Writing something as a linear combination is not a big deal. Alpha one times vector one plus alpha two times vector two plus alpha three times vector three so you can come up with a linear combination for the first row of measurement and then for the second row of measurement and the third and so on and so on. So that's not a big deal. What is the big deal? PCA is not done yet. Now you have to determine a point. So now you need a point here. So peak and prime much smaller than n. So if, okay, so if I do this, if this is capital N. So now peak and n prime that is much smaller than your actual N. You had 5,000 features. How many principal components would you make you happy? Okay, 5,000, let's go with 2,000, 5,000. Come on, be more aggressive. Okay, 500. No more, 106 principal components of the 5,000. That would be outrageous. But PCA can do that for many, many data sets. So you can actually calculate the residual error to calculate to find n prime. So how many of them can I, if I go back here, so if I just ignore this, will my data change anything? Not much. This one, not much. This one, not much. You come up to lambda 12 and suddenly starts to drop the reconstruction error. So, oh, stop. Okay, so take 13. So this is a whole different story to pick, how to pick. For small applications, I don't know why people do that and nobody asks himself, is it empirical knowledge? People go with two or three principal components. For small applications, where usually you have less than 100 features. So give me the two principal components. The three principal components, which means in that gigantic table, you just take two features, which could be X5 and X235. Nobody knows. Can be anything. Okay, what is that? So what we are doing here is what we call, generally, dimensionality reduction. And please, please, please, if you keep one thing from this course is this as a machine learning guy, AI guy, don't be arrogant. Don't say my network has 300 layers. We will figure it out. No, make the life of the network easier. And if you go to industry, you see that in contrast to the mentality in academic institution, people tend to save money and don't waste money. So if you go with 5,000 features, it will take two weeks. The training. You don't want to do it in two days. So dimensionality reduction is extremely important before I go and even think about what type of technique I want to use. Because this is the problem. In that table of X1, X2, Xn, you cannot manually or visually recognize what is important and what is not. So you need some statistical analysis to figure that out. Okay, if you have just purchased the NVIDIA super GPU server for half a million dollar and you want to play around with it, be my guest, okay? Just forget about PCA. But the rest of us, mortal people with limited money in our pocket, we do PCA. So, and trust me, I have seen people fired because of not using dimensionality reduction, not just PCA. We have LDA, we have TISNI, we have other stuff. But that means you don't care about resources. There are things that doesn't happen. You are doing conventional math or AI or whatever. We don't have unlimited resources. And this is one of those things that I love to hire that type of engineers and computer scientists. When I see that he or she has also a sense for resource management. Okay, let's analyze this and compactify the data and then we will think about the design. Wow, that's the engineer I want to hire, yes. The application for PCA, there are so many. There are so many, I will give you some. I have some here. Biggest one that I'm involving, computer vision. So you get thousands and thousands of features and then you have to select some of them to be able to do something meaningful. So PCA is a linear transformation. This is a disadvantage, but it may seem so simple that although I skipped many steps and I did not went into the details of this, because then we would need three lectures to just talk about single or value decomposition. Still, it made it so simple that we could understand the basic ideas in less than an hour. Because it's linear, but the world is not linear. So what happens if the linear PCA, the conventional PCA doesn't cut it? Well, we have many things that hopefully in the tutorial we talk about, some of them use kernelized PCA to apply it on. So we bring the data into a linear space and then we apply PCA. There are so many tricks. SVM does the same thing. We will come to it. It's unsupervised. PCA is unsupervised. I love unsupervised techniques. They don't need a babysitter. Please tell me, what is this? How can I do this? I don't know, figure it out. I'm sorry, my name is normal network. I need data. This is unsupervised. You will learn with the time to appreciate, to have a deep appreciation for unsupervised learning. You just get the data, you throw it at it, go. No teacher, no instruction, no desired output. It will figure out what the output should be. Very, very important. Of course, it uses statistics and calculus is a dimensionality reduction algorithm. It's also a visualization algorithm. When I say some people, most of the time, they use, okay, PCA for finding two or three principal components, it has a practical reason. I cannot visualize thousand features, but I can visualize two features. I can visualize three features. If I have thousand features and 997 of them are garbage, almost garbage, statistically speaking, I can take the three important ones and visualize it and rotate it and look at it. That has immense value. At the moment, one of the reasons that prevents the widespread applicability of AI in many fields is visualization or the lack thereof. How can you visualize what it has been learned? Very important. PCA is always one of the first tools we look at and is, of course, intelligent because it recognizes significance. So what type of data do we have? I just wanna connect this. We learned something about the first thing in machine learning, which is take the data, clean it up by getting rid of the junk. That doesn't matter. And then I can start working with it, but okay, so we are basically talking about AI and data. So what type of data types are we using? We are using just numbers. We are using sometimes symbols. We are using text a lot. We are using images. We are using videos. We are using audio files and maybe some other data representation. Each one of them has its own specific requirement. The way that you treat text is very different from images. The biggest, so the biggest success stories in the past six, seven years has been for images. And that also has contributed to the fact that everybody is using it because everybody understand images. If I talk about symbolic automaton, my grandma doesn't understand it. But she can look at the iPad and say, yeah, okay, so we just search, search. And then, oh, we can find it. So that was the reason what text is different. Symbols are different. Videos, oh, we have not even touched videos. Videos are tough. I'm not talking about, I don't know, simple things. There is no such a thing as video recognition at the moment because we don't have the bandwidth to attack videos. Videos are a gigantic collection of images. At the moment, we are dealing with one image at a time and that image is tiny. 240 by 240. So images that I'm working with in the lab are smallest or 50,000 by 50,000 pixels. Medical images, satellite images, astrophysical images, there are many, many oceanography. Many, many cases. So you have to know what are your data. It's very important to know. It's part of the analysis to pick a project. What is the data? Do you have data? Well, if you don't have data, I don't know what you're doing here. You need data. Either you have an equation and then you do interpolation and derivative building function approximation or you don't have equation, you have data and then we do function approximation by our networks and regression and things like that. Then we have to do some pre-processing. Of course, you have to do some sort of filtering perhaps in some cases. You have to do normalization. You have to do outlier detection. You have to do dimensionality reduction that we just talked about and you have to do augmentation. Maybe some other stuff. These are the most important one. So of course, filtering of an image is very different from filtering the text. If I'm filtering the image, I want to get rid of noise. Speckle noise, Gaussian noise, salt and pepper noise. If I want to filter text, I want to get rid of A and the, the words that are everywhere and don't have any meaning. I want to keep the important words. Normalization. Sometimes you have to dig in and you see the network is working so nicely and then you implement it. You read the paper and say, oh my God, they have 99% accuracy and then you implement it with the same library and it's not working. And then you send an email and they don't respond. And then you wait and some of them maybe at some point responds. I say, yeah, we didn't mention in the paper at between the second layer and third layer. We normalize the data. Ah, okay. Many inputs are normalized. If you look, if you look at the, look at this interestingly at the mean subtraction, look at any phase or object recognition, there is an average image that you have to subtract from your data. Isn't that ironic? PCA introduced that 120 years ago. So apparently this subtracting image, which is normalization, is very important. It puts us in the origin so we have a point of reference to compare to. So you have to figure out what type of normalization do you need? I claim, I claim there is no AI project that you do not need some sort of normalization. You have to normalize the data. Outlier detection. So if you have, if you have some sort of data, so if this is the hyper-dimensional data that I'm just maybe, I grabbed, I grabbed the two principal components out of 5,000 and I'm looking at it and I have one here. So that's an outlier. Is it? If this is Kitchen and Waterloo population and you are looking at income and the size of house and how many course people have, this is not an outlier. This could be Mike Lazaradis. So what does it mean outlier? Does it mean his noise? I can filter it? Or is valid data, for example in cancer diagnosis could be a rare case. And you are a small hospital and you never had that rare case. So if you filter that out, you lose information. How do I know? Domain knowledge. You have to specialize. Also as AI expert, you have to specialize. AI for satellite imaging. AI for robotics. AI for this and that. You need the domain knowledge. Of course you do. So how do we represent data? How do we represent data? We represent data in two ways. First, handcrafted features. Well, we only had this until 2012, 13. There was only handcrafted features. You would sit down and say, okay, what type of measurements do I need to do stock market prognosis, cancer diagnosis, robotic navigation in unknown environment? What type of features do I need? So that's handcrafted features. So the attributes that are necessary for making classification, segmentation, prediction, estimation, function approximation are designed by somebody. So for example, usually generally we get some stats. Average, standard deviation, variance, skewness, things like that. If you go specifically, let's say, you go inside images, then there are things that is called SIFT. So it's a matter that somebody has sat down and come up with that idea. If you show me an image, I will calculate this and this and this and this. It's handcrafted. The computer doesn't figure it out on its own. We have told the computer one, two, three, four, five, do this. And of course, since 2012, perhaps, wow. Automatic feature extraction, which is mainly deep features. This is new. This is new for everybody. And deep features is one of those reasons that this is a curse and a blessing at the same time. It's a blessing. It has contributed to the massive popularization of AI. Now everybody can just download one Python library, download the pre-trained network. There you go. You have the best features on the planet. This is exactly the problem because then you think you can do things without domain knowledge. Well, you can't. And many of us are worried about the third AI winter. That this type of people promise too much. And then they cannot deliver in their company, in their institution. Please, if you leave this class, learn one thing. Don't say you are an AI expert. I don't know anybody who's an AI expert. If somebody tells me he's an AI expert, I will say, in what subfield of AI are you an expert? And then if you say neural network, okay, what type of neural network? You cannot be possibly an AI expert in everything. Don't use that title. It's dangerous. It's becoming just wore down. So people are fast in dismissing this. As an engineer, I don't like it because there are many fields in which still handcrafted features have a lot of value. So please resist if you are in a team, have some backbone. And if you're sitting in a meeting and everybody's showing off the deep features, this, this, can we also look at the conventional features? And everybody look at it. Yes, let's take a look at conventional features. Why? Maybe we don't have enough labeled data so we cannot train anything. But what about retrain network? We have regulatory restrictions. We cannot use a pretrain network. This is a professional talking. So what comes up next? And coding. Let me go here. So what type of data do you have? How do you preprocess data? And what type of features and attributes do you extract from data? So this is like this. Feature, artificial neural network. This is like this. Artificial neural network. Is in one. It's very enticing. Everything in one hand. And if dot one hand collapses, everything is gone. So I don't want to put all my eggs in one basket. And coding. Next thing for looking at your data. So encoding basically means compression and embedding. So either you have no learning or you have, you do it with learning. I know that doing things with learning has become very attractive and this is fantastic. Especially for those, has anybody looked, go back, look at the people whose name is in the news right now. People like Jeff Hinton, Joshua Bengio, Lacoon and others. Richard Satin, everybody else. Go back 15 years in their career. Nobody was paying attention to them. They were not getting money. Nobody gave them a Canada Research Chair. They were working empty-handed. So for those people, all of us, we are happy. That now learning has become so widespread. That's fantastic. But it's not a reason to just ignore everything else we have. Here in this category, we have PCA. PCA is all over the place. It's in filtering, it's in encoding, it's in compression, it's in visualization. I have seen people classify data with PCA. They love PCA so much. They say, I do, from A to Z, I do everything with PCA. That's the other side of the coin. So exaggerate in that direction. Fisher Vector. I bet not many people have heard that. LDA. Hopefully we talk a little bit about LDA. And Vlad. There are some techniques that help you to compress or embed. What embed? What does that mean? You give me 10,000, and I pick and choose, do some calculation, and make 200 of them. It's like PCA, but we call it embedding. It's a different type of algorithm. We talk about one of them, at least. With learning, of course, autoencoders, one of the most fascinating type of neural net force. Deep artificial neural net force, autoencoders. So you have your layer, and layer, and layer, and layer, and come back, and come back, and come back. You give X in, and you get X out. What? Why wish anybody of the same mind should do that? Do you input X and out comes X? Why is it good for? Well, if you do that, we will talk about autoencoders, of course. If you do that, what is here, in the deepest layer, is compressed X. Autoencoder is the neural network version of PCA, if you like. But it has a lot of headaches, so I have spent many, many, many hours of working with it. I love PCA because PCA is deterministic. Just go one, two, three, four. It gives you something. Autoencoder can be a pain in the neck. How many layers? How should I do it? Without converge. So, this is limited. I cannot put one million inputs here. Theoretically, you could. But do you have 5,000 Tesla GPUs? I don't think so. We can't. Practically, we can't. And of course, we have Tissani. So I wanna talk about Tissani. If I can make it, we will talk about Tissani in one of the tutorials. Also fascinating, too. You can also visualize everything, multi-hyperdimensional data. You can put it in two dimension and show it. Suddenly, you see the complex relationships because you can visualize it. So, it's visualization, it's embedding, it's compression. Sometimes, these walls just flow into each other. Don't be, know all this terminology. This is the first rule. Know the terminology. You should not act surprised if someone says, I'm embedding, and you say, I'm embedding? What is that? If you don't, no ask. But I'm just saying we should know the important terminology for all techniques we have. So, what are the applications? So, what are the applications of PCA? Of course, data reduction. Of course, data visualization. Of course, data classification. Strange, but yes, people do that. Directly or indirectly. I would say not directly. Don't do it directly. PCA is not a classifier. You cannot abuse it for classification. And I wouldn't know how you can do it just with a conventional PCA. You need some other component to do that. But you can do it as a precursor to classification. Factor analysis. Factor analysis was one of the first applications for PCA when there was no AI. Nobody thought PCA will be a component in AI. So, people were using it for factor analysis. For trend analysis, trust me, classification, which is right now, 99% of things being done is only 1% of things that we need in the practice. But classification is very exciting and very enticing because you do something, you have labeled data, and you report one number and you say, my accuracy is 99.5%. Something like trend analysis is much more usable and much more feasible and much more needed. So, let's not make any part of AI an orphan. So, we keep everything. We want to have a repository of many, many techniques. And of course, noise removal. I think as engineers, we understand this one very nice. I want to get rid of noise, so that's important. So, what is a meaningful chain? What is a meaningful chain in AI? Data comes, somebody will give us features. Somebody has to give us features. Then we usually encode these features and we get a compact representation. So, you give me thousands upon thousands of features and I embed them, compress them, force them into a more compact appearance and then give it to you. So, that's encoding. But still, it may be too big. So, you give me here 5,000, I encode them, becomes 2,000, but it's too big. So, then I compress it or do reduction. Then I get this. And then I give it to a classifier. For example, here I could use Fisher vector. Fisher vector. Here I could use PCA, LDA, Disney. Here I could use artificial neural networks. I could use SVM. I could use K-means. I could use any classifier. This is still pre-deep network. Because now the deep network say, you know, don't worry about any of this. So, deep network can do this. Of course, there is no Fisher vector, there is no PCA anymore. Then if everything is done by a deep network, there is none of that. Of course, it's very interesting, you know. I don't need to understand what is Vlad, what is Disney. Just put the data in a network and it extracts the features and then you have your classifier. Why not? That would be great. You will see that this is not always desirable. This is not always desirable. Okay. So, if you do your, how much time you have? We have four minutes. Okay. So, if you want to select your project, you find a problem. You analyze the problem. Analyze the problems means basically, what is input? What is output? And do you have any knowledge about the domain? Then third, you select an approach. You see, for me, analyzing the problem input output also includes filtering, normalization, dimensionality reduction. That's analyzing or preparing it. Select the approach. So, what type of architecture, for example? What type of parameters? And so on. Four, you design the approach. Whatever design means for that specific algorithm. And then we train it. So, we train the approach. And six, we re-run, we re-train if necessary. We train if necessary. And seven, then we go into the recall phase and we start using it. And then eight, we compare against other methods. That's a typical AI project for this course. So, find the problem. Is it a problem? Don't invent problems, please. Find a problem. It's very different. Analyze the problem. What is the input? What is the, is there data? What type of data? Is it numbers? Is it video? Is it images? Where does, do you have enough? Is it a perfect, again, to me, the perfect project is if it is related to your fourth year design project. That's fantastic. If not, find something that is publicly available data set. Cagul is one. There are others. So, then you select the approach. You say, okay, this is the data. This is the input. I have so many. So, I have labeled data. I don't have labeled data. Then you select, I wanna do this or I wanna do this. That's where we may need to talk. So, and then you design the approach, which means, okay, so if it is a network, how many layers, if it is SVM, is it binary, is it multi-class, what parameters, and so on, then you start training it. If it is text, every training will take 10 seconds. If it is image, it takes 10 days. So, you have to be cautious about that and aware of that. We may have to retrain, and it is all the time necessary to retrain because the first result is junk, and then you have to retrain, change the architecture, do this, do that. I run it, and the best accuracy I get is 23%. So, what's going on? And SVM, who said SVM has a guarantee that it gives us the optimal result? What was the parameter? Have you said what parameter? As part of the design, you have to be aware of the parameter. And then we start applying it on unseen data, and then we know how well it does. So, during the training, it may give you 99%. If you recall it and it gives you 52%, that means you don't have a solution. So, the difference between these two numbers cannot be more than 23%. And never, ever, this is higher than this, never. If you see that, something is fundamentally wrong. So, okay, we will continue next time. So, with two other visualization techniques before we start with the actual material.