 Hey, let's get started. So the videos you're seeing on the screen right now are video recordings of microbes swimming around in a volume of water. I forget what the exact sample chamber size is, because I'm the esophagus. But it's basically bacteria and different species and such swimming around. Microscope being used is called a digital holographic microscope, which is interesting in that it captures three-dimensional movement data, or three-dimensional data in a two-dimensional frame. It encodes the depth of information in Fourier space, which is really cool. It has to do with kind of a most work-sect LIGO kind of interference between two lasers to kind of encode where each particle is vertically. And therefore, we can retrieve depth of information in addition to this wide information. So for a potential mission to Europa or Enceladus, which our ocean world sees are moons of Saturns and Jupiter that are covered in water. And all we know, life depends on the existence of water to exist and so forth. In designing such a mission, it's of interest to include an instrument like this to see if there's anything swimming around in there. Because metility is what we call it. Metility is one of the strongest lives in the interest we have. It's the self-repellent motion of some life to find food, to compete with each other, to explore, et cetera. So there's a project at JPL to develop such an instrument and to develop the processing algorithms for such an instrument, because an instrument like this records terabytes of video within a couple of observations. We have gigabytes of terabytes. And because of the inverse, I guess, the bandwidth is constrained by distance, where when we're as far out as Saturn and Jupiter, there's not that much bandwidth. We can't download as much data. So it's feasible to download video from these potential missions. And so all we can do is do processing onboard to find and track these particles and send down just the tracks and just some snippets of the video to show the scientists that we found something up there that's moving. So this is your project one. So your project one is called Matility Biosignature Classification. So what we've done for you is we've taken these videos and we've tracked these particles already. So now we have the x, y, and time locations for each of these particles. Your job is to develop a classifier that will determine if each track, if each particle track exhibits utility, if it looks a lot like it's moving around and looking for food and reacting to stimuli, et cetera. At face value, this seems fairly easy because as you saw in that video, our human brains are very, very good at identifying if something looks alive or not, because it's very important for our own survival. And so through millennia of evolution, our brains have gotten pretty good at this. But when you try to quantify it, when you try to get a computer to do it, it actually does get quite challenging. So let's see here. This is your mini project. We've, so we also have the Piazza Post, right? And we have the actual PDF. The PDF provides additional information on where the data comes from, details about the competition, where you submit the capital invite link for joining the competition. And it has a lot of resources to get you started. The first thing that we did do, the first thing that we did do is we took those tracks, those X, Y, T coordinates, and we've extracted some basic features for you already. So if you want, you could just go into the data, pull up the features, where's the trace? Pull up the features CSV. And it has a unique ID for each track. It has the label as a binary, zero is down, level one is low. And then it has some features already implemented for you, and I've described those in the PDF. And then there's also a Jupyter notebook that I uploaded to show you how I generated those features. So you can actually go in the code and see how did you calculate the average speed or how did I calculate the standard deviation of the perception, et cetera. You're very much encouraged to develop your own features. In fact, I've intentionally omitted several very useful features because the project was too easy with those five columns included. And so, because this project allows the teams of up to four members, I think it's probably a good idea to assign some students to develop features, assign some students to developing the methods or the models, et cetera, or whatever split you see fit. But I think there's enough work here for a whole group of four to generally contribute to the project and try out some interesting things. So you can download this Jupyter notebook. This is a sample code. You can modify it. I actually made it so that if you implement your own feature functions, it'll just generate the features for you. You don't have to mess with CSVs and all that. It doesn't use pandas at all because I don't use pandas. So, feel free to use this. And also, Cable provides a collab-like environment. In fact, it's owned by Google, so I think it's the same back end. But you can actually run notebooks on Cable, generate your submission, like final prediction on Cable and submit it. So you don't have to go back and forth to take a collab and such if you don't want to. You can also get GPUs on this. So, anyway, so that's the features. So we provide the features. And then, in addition, in addition, we actually provide the tracks themselves. Data, go down. So we have the basic features. And then we have the original tracks. So if we go here, train the JSON. We also provide these SCSCs. If you prefer to work that way, we have a different CSE per track. And then here, we have per track, we have the TXY coordinates. So, PIME, and then the XY coordinates of the tracks themselves. So if you want to try an RNN, and I haven't tried an RNN on this, I don't actually know if that works. But you could theoretically run an RNN directly on the coordinates instead of extracting any features. Or you can use to provide a notebook to extract your own features from these coordinates. All right, so that's the project. It's really fun. We're writing a paper on this right now. So if you do incredibly well on those, I'm scared. Very unlikely. Well, so that's not because of anything. It's because the project that we're working on has to run these models on the spacecraft itself. So we're limited to decision trees, SVNs, fairly simple models like that. The computer that we're targeting is like a RAD 750, which is the equivalent of an Apple II CPU. From like the Apple II, from the OG, like Steve Jobs things. So we can't use any of the other aspects, but you could and you can see how well it works. Oh, one more thing. So we've also included in the training set, both data actually collected from microscopy videos, so from the actual microscope, as well as simulated data from a simulator that we wrote to generate kind of reasonable data to mimic the behavior of modal particles using sub-modic or regressive, using some methods. But the test set on what you'll be competing on, there is a competition. There's a leaderboard. If you haven't done a catalog before, there's a ranking. You submit your results, you rank you. First place gets like t-shirts or something, I don't know. And extra credit. There's extra credit for like the top 50% or something like that. The test set on what you'll be competing only contains microscopy actual real data. So your training set combines real data and simulated data. And your test set only includes lab data. There's a very common setup in scientific applications where it's expensive to hire a microbiologist to sit there with a microscope, feed in samples. There's all kinds of instruments. You have to develop the instrument, et cetera. So it's expensive to collect real data. So instead, we generate some, it's expensive to collect real data. So we write a simulator to generate simulated data because that's easier for us. So you have to decide if you want to use the simulated data at all. And you'll know that it's a simulated data because the track UID starts with sim instead of lab. I'm gonna keep scrolling here about it. Yeah, there we go. So anything starting with sim is simulated fake data quote unquote. And then anything starting with lab is real data. So you might want to weigh the lab data more. You might want to not use the simulated data at all. I don't know. I actually don't know the answer to this. So that's kind of part of your research. And then finally, the evaluation. So we've been using loss and accuracy so far for kind of evaluating how well our models do on the data set. For this problem, we're gonna be using something called F2 score. So F2 score is a ratio of precision and recall. I don't know how we haven't covered recall and precision yet. I really should have. But essentially recall is how much of the true, okay. There's a positive and a negative example, right? So there's ones and zeros. Recall is how much of the positive examples we found. And precision is, there's not like an intuitive way. Recall basically is high when you have a lot of true positives and not a lot of false negatives, right? So high recall means you found all the ones that you were supposed to find. Precision is highest when you don't have a lot of false positives, right? So I could predict one on everything and have perfect recall but awful precision, right? Because you don't have any false negatives but you have a lot of false positives. Does that make sense? Okay, if I have time at the end, I will actually write out the equations for precision and recall. But anyways, precision and recall is really important when we don't have a lot of positive signals, right? If we have like five ones and a hundred zeros, if I predict zeros on everything, we'll have a 95% accuracy, right? But that doesn't mean anything because we didn't actually find the ones that we were looking for. F2 score emphasizes recall. So imagine that we're on a mission to install out a sort of a, if there's a modal biosignature, we really don't want to miss it, right? So we can't have that many false negatives. But if we spuriously find, if we spuriously analyze some non-modal tracks, as modal tracks, not a big deal, right? We just lose a little bit of bandwidth but no arm done. We don't lose any signs, right? But if we lose a track that is modal but the algorithm says it's not modal, that's a bigger penalty than the other way around. That's the intent. So we're doing F2 evaluation metric. I want you to, and now remember that I think we included that as an extra credit question on board. So I guess if you're extra credit if you're watching the lecture. Anyways, all right, so that's the project. I won't go on too much about it because the rest is in the catalog and in the PDF. Remember in the points lineup, 10 points is for submitting something on the catalog that's like reasonable. We have a TA benchmark, you're not required to beat it but 10 points just for a catalog participation, 90 points is on your report, right? You're gonna write up a research report, explain all the things you tried, explaining all the decisions that you made. That's the real important part and we're gonna be reading your collab notebooks. We're gonna be looking for plots. We're gonna be looking for the way you plotted your data and the way you designed other features and et cetera. So that's the really important part. So just don't rush your report because the catalog is due on Monday and then the report is due on the next and I know everyone's gonna work like crazy on the code and then not start reporting until the day of the deadline. Please don't do that. Right, we're gonna report throughout, keep a research notebook, pretend this is like research practice a little bit, right? Keep notes and at the end, you'll save yourself a lot of work if you have like half the report done even by the end of the catalog deadline, okay? So the catalog is due next week at 5 p.m. and then the report is due at 9 p.m. the day after. So giving you a little bit of a buffer there. Okay, everything else is on the website, on the PSN. Any questions on the project? Okay, cool, I'm pretty excited. You know, before it was like a loan data set so I didn't really like it because I didn't like the finance data sets. It was weird. Pretty good that this person will default on their loans and ruin their personal finance. But this is like, you know, this is science. So this is, and this is a product that I've been working on for about two years now. So this is pretty exciting to be able to share my work as well. So let me know if you have any questions. You're also gonna be a PSN? No. So just one week, just this project, we'll go back to a PSN after that and then, so it's one week project and then PSN afterwards, yeah. We have a little bit of overlap in the old schedule but we managed to stretch it out. Cool, all right, let's go over to the lecture. Do I have anything else I want to share here? No, there's not. Oh, one more thing. There's just so much, I do a lot of work on it. I also have a track visualization example so that you can actually see what some of these tracks look like. You can look at this example. Again, volatility looks a little squiggly. Non-motility looks like a straight line. You would think that that's easy to do, right? Okay. All right, so after the lecture, the reason I spent so much time on this is because today's lecture is a little short so we have some extra time too. There we go. Okay, great. So today's lecture is on clustering and dimensionality reduction. I have this slide that I didn't use at all but that's a quick summary of everything I talked about so far. Okay, so in the past few weeks with Dr. Rebecca Brigadas half of the course as well as the first deep learning course, a couple lectures that I talked about, we've been talking about supervised learning and just as a quick reminder, supervised learning just means you have data and you have labels and you're trying to go from the data to the labels or not even that but you just have some kind of external annotations or a task that you're trying to perform. That broadly falls under the category of supervised learning. And within this topic, we've talked about linear models, overfitting, loss functions. So we've talked about non-linear models, learning out of this optimization and then we'll do a little bit more about modeling what we've mentioned that briefly. So supervised learning is certainly, I guess, Professor Yuson kind of referred to as the workhorse of machine learning, right? In most practical senses where you're trying to train and deploy something to do or automate or improve some process, it's probably gonna be supervised learning, right? Well, supervised learning is expensive, right? And that's because of those annotations, those labels. It's always difficult to not only get data that's claimed in structure, that just data collection in and of itself is such a hard problem in itself but then to have some human sit there and label this data and I'm sure you've heard of all these different startups and San Francisco Barrio companies and Amazon Mechanical Turk, you've seen ads for this shortly as CS majors slash students interested in AI. I get these ads all the time on Facebook or Instagram or wherever I go, they just follow me around, right? They offer to label your data for cheap and that's because it's a hard problem, right? So the other aspect of machine learning that we'll now talk about is called unsupervised learning. So unsupervised learning is trying to do machine learning without labels and we do this by extracting hidden structures and when I say hidden, I don't mean like, turn on the combs, you know, like it's not, no one's obscuring it, it's just that the data is so complex that it's not easy for us to draw out the structures in the data just by looking at it, right? We need these additional algorithms and these additional methods to derive out the structure, the hierarchy, the patterns in the data and then from that it creates some sort of framework in which we're able to do the machine learning task or do analysis on that data, right? So in a sense, without manually labeling the data without providing annotations, without providing the truth we're instead implying what that is through the structures and patterns it better than the data itself, okay? So that's kind of what the intent of unsupervised learning is. Note that this is different from self-supervised learning or kind of deep learning unsupervised learning who tends to kind of take this in different directions as well so don't get confused by that. We're just talking about machine learning unsupervised learning and people learning kind of likes to eat things up and reinterpret them and use them for its own purpose. So just straight unsupervised learning. So today we're just gonna talk about clustering and dimensionality reduction. You may be familiar with some of these topics already especially if you've taken the Linux course we should know about matrix factorization and PCA and SVM but we're just gonna look at those from the perspective of machine learning and see what machine learning really uses that for, right? And then clustering is one of those things where conceptually it's pretty easy to understand but when we talk about algorithmically it gets pretty interesting. And clustering is a very interesting field especially when you kind of get into the theory of it it seems simple at face value but it can get very complex and very theoretical very interesting. Okay, so let's talk about clustering. So clustering is exactly what it sounds like. Clustering is a process of grouping data points into clusters where we're just grouping things. And again humans are really good at this because through evolution it seems that clustering things into groups is beneficial to our own survival, right? Is that group of leaves are tree, right? We're very good at looking at things and grouping them to a fault sometimes taking some sociology and psychology courses like in group out group stuff, right? So clustering, we're just taking our data and we're saying can we group these into categories? Now talking about this a little bit more specifically and we're starting to try to define what a group is and what a cluster is, right? So the qualities of a cluster that we want is that within each cluster, intro cluster, we want high similarity by whichever metric we define similarity, right? And then between clusters, so intro clusters, we want low similarity in general. Again, in whatever similarity space that we define and sure there's like five million different ways to define what similarity is, right? But in general this is what we're going for. So one such example is this example. And by the way when we talk about clustering we've shown examples in two dimensions and it looks really easy because it's in two dimensions. But every time kind of think that, okay how would I do this in a thousand dimensions, right? We can't visualize it so it's a little hard but the examples look very simple in two dimensions. That's because it is. So we have this data. How many clusters would you say is in this data? Three, right? So we're very good at this again, I guess that. But that's the goal, look at this data, convert it into something that's interpretable. And then now that we know that there are three clusters we can make some assumptions about the data, right? So we can assume that all the data points in each cluster are similar. So this creates almost like a kind of summary for us. So previously if we had to understand this data I might have had to look at every single data point, right? But now if I just take one sample from each cluster I might think that okay I have a pretty good sampling of the data set, right? So instead like maybe if I said I randomly sample three points like uniformly randomly sample three points from this data set I might have sampled three points from the same cluster and I wouldn't have a good representation of the data set, right? But now that we have this cluster we can sample one point from each and now I can be fairly confident that I have a good sampling of the data set, right? So that's kind of the intent of whatever good clustering. Okay let's formalize this even further. So given some unlabeled data, again no labels, labels are banned for this particular lecture. The goal is to find hidden structures. For example, well I've harped on this enough. The another interpretation of this is a generative model of data with some probability. We're gonna talk about probabilistic methods in further lectures so kind of bank that for now and we'll talk about it more. But in a sense we're trying to generate some low dimensional summary of the data and not dimensional in the X and Y sense but dimensional in the sense of belonging to a cluster, right? Before let's say we had an X and Y so it's two dimensional data now we have one dimensional. Is it in this cluster or not? Any questions up till now? Very basic definition of clustering. All right so why is clustering useful? So as I mentioned before clustering itself as a summary of data, this is a really good example. I've seen this example quite a few times but if I search Google images for the word Pluto this is kind of all the different images that come up but you can use the methodology that this paper used to group these images into clusters by wrong, right? So the first row I think is that Mount Wilson? That might be Mount Wilson. No, never mind, never mind. It's some observatory there's a Pluto walk. Second, I think there's a curvature design thing and the third is like the actual dwarf planet, right? And then there's some TV show that was like really old called Pluto and then we have the dog Pluto and then we have some artists, right? But the point is even within this one seemingly homogenous category Pluto there's all these different clusters, right? And so if we run a clustering algorithm like this on images we can quickly see what kind of subsets there are within this data and we can kind of assume that these examples are shown or representative of everything else that is kind of similar to each cluster as well. So that's pretty cool. And then the next thing that clustering is useful for is pre-processing before supervised training. So again, because we can kind of interpret this as a dimension-out-of-reduction set we can do clustering before we do any kind of supervised learning because it'll make the problem simpler for the model that we're using. And for neural networks and such it doesn't really make that much sense because neural networks do such high capacity but for certain tasks where we're going into tens of thousands or hundreds of thousands of dimensions this might be something that you have to do. Also for labeling, right? It might help to pre-cluster your data before you start labeling, right? So instead of having the labeler go through line by line and say this is a dog, this is a cat, this is a horse or whatever we can show them an entire cluster and go what are these and then they can go these are all horses, right? And then that's just one labeling set instead of going data by data. So kind of it can be useful for supervised learning down the line as well. Okay, so let's talk about one method of clustering. This is called k-means. I'm sure some of you are familiar with it and this is probably the most naive way of doing clustering which is why it's also commonly called naive k-means clustering. So that's what this means. So given some data set here, so again our data set is conveniently clustered in two, three, four, the same example. But given some data set, what we're going to do first is we're gonna take these data sets and we're gonna randomly assign them to k-clusters. So in this case we're gonna pick k equals three. Obviously picking the correct k is a whole other subject on itself, right? How many clusters should we cluster into? Let's assume that we somehow, some oracle told us that three is the right number to use here, okay? So we're doing, we're setting k equals three. So we're gonna randomly assign all the data set, all the data points into three buckets, right? Into three categories. We're just gonna label them randomly. So that's the first step. So now we have these random points with random labels. And then for each categories, so for each cluster that we've labeled, so for all the red data points here, we're gonna take the mean. So we're gonna take, hence the name k-means, right? We're gonna take the mean of that data set and assign that as the center of that cluster, okay? And we're gonna do that for black and we're gonna do that for purple. So now we have three clusters that were randomly labeled and then we have the means of those random cluster assignments, right? The next step, we're going to look at each of those means and then we're going to look at the data points that are closest to those means. So before we assigned the data points to clusters and then we calculated the mean. Now we're gonna fix the means and we're gonna reassign the data points to those means, okay? So you can see here, we had this before. We calculated our mean averages, our cluster centers and then we assigned new data points. We reassigned all the data points to their nearest center, okay? And then after we do that, we then recalculate the mean again from these new assignments. Then the means move a little bit, okay? And then we go back again, we go back and we reassign the nearest points to their respective means. And then we do that again and the means shift, whoa. So we recalculated the mean center and then we reassigned, we calculate, we assign, we calculate, we assign. Until eventually, when you do this update, nothing changes, we've converged, right? We've converged on the mean center, the mean center doesn't move anymore and new points don't get assigned to each, okay? So to formalize this objective, ooh, maybe the laser will work. No, I don't know what I'm doing on the screen. To formalize this objective, we have this set here. So S is our full set of all the data points and we're seeking to minimize, we're seeking some parameters that minimize, right? The distance between each data point and it's assigned cluster center, right? So we're using the L2 norm here with the two vertical bars with the two on top means. L2 distance, just the distance between two points. Euclidean distance between two points. We're trying to minimize the Euclidean distance of each point from its assigned cluster center. And then small c is the cluster center itself because we're optimizing the location of the centers themselves, right? But then we also have big c, which is the membership of each point to each cluster center. We're trying to find both the ideal cluster centers as well as the ideal memberships of the points to those cluster centers that minimizes the overall distance of each data point to its assigned cluster average. Cluster center. Does that make sense? Does that first equation make sense? Any questions about that first equation? We're just formalizing here. We're not actually gonna go anywhere with this. We're just formalizing. The next equation we have below it is, so we can actually show that for any given cluster set, the cluster center that minimizes this loss, right? This Euclidean distance is actually the cluster main, you can prove that, right? And so if we can assume that the cluster centers will always be the mean of each cluster membership, but if we can define small c by their big c always, right? Then we can actually redefine this as the variance, right? So we have the cardinality of big c, right? So that's a number of data points that are in the cluster set. And then we have the variance, which is just the difference between each data point and its main, right? So we're just redefining, we're just taking advantage of the fact that the cluster center is always gonna be the mean of its cluster set, and then we're redefining the Euclidean distance as the variance, because it's the same thing, okay? Does that also make sense? Okay, so then the logical explanation after that is that we're just finding the cluster memberships that minimizes the variance within each cluster, okay? And that makes a lot more sense if you're tied into the variance distribution mindset that might make a lot more sense than just minimizing individuality of the distance. Okay, so trying to find the clustering that minimizes the variance per cluster. So let's formalize, and the algorithm that we just demonstrated to find these, to iteratively find these cluster memberships, these cluster sets of these cluster centers. So we have, so we call this the EM algorithm, we call this the expectation maximization algorithm. This is a very common paradigm for unsupervised learning. This happened, you will see this everywhere. You will see this kind of set up for different unsupervised problems very often, because this is how you converge, right? You don't really get a gradient, right? So instead you iterate between these two steps to converge. So first we have the expectation step. So we're estimating big C given some initial state, right? So we estimate cluster membership. Initially we did this randomly, right? We randomly assigned data points to clusters, right? And then the maximization is when we're estimating the cluster center, the small c. And we do that by just taking the mean, right? And that's the actual model parameter, right? So the clustering model itself is defined by the location of the centers. We can just derive out the cluster memberships by, you know, finding which points is the closest to each center. And we just iterate back and forth between the two while we do that. The maximization step is called maximization because this paradigm originally existed for probabilistic models. And so typically what people were trying to do is maximizing the minimum log likelihood. So if you have some probabilistic model you're trying to maximize the likelihood of that model given some conditional probability, right? But you don't have to worry about that for now. Again, we'll cover probabilistic models later on in the course. We just know that this E&M is very common and we're just stepping back and forth between, again, estimating some cluster membership, updating our model parameters based on those membership and updating our memberships, updating our cluster memberships based on those new parameters back and forth. These slides are really useful. Okay, okay, real quick. Patient step, assign the clusters to their smallest distance to C, assign the cluster memberships and then based on those cluster memberships we compute the parameters. So you have to assign the cluster memberships randomly at first? Yeah, that's how you initialize the other one. You initialize randomly at first? Exactly. You could also randomly assign cluster centers but I think that takes a lot longer to converge just because you don't know what the possible space is really to set the centers, right? But theoretically, you could start by randomly setting the center somewhere. Yes, okay. Yeah. So what'll happen a lot of times in an E&M panel? What'll happen a lot of times is you'll just run clustering with different values of K and pick the one that minimizes the variance globally, right? For naive K-means, obviously you can also use not K-means but I think that might be a better solution if you don't know how many clusters you need but that is one strategy is to sweep some K-value and then do it. The difficulty with that too is the complexity. This is fairly extensive, especially if you have a massive data set because you're measuring a lot of utility in businesses. So it's like big O of NK are something, like it's four different variables multiplied because it's like you're dimensionality, how many data points do you have, how many clusters do you have and how many iterations are you doing? It's all of those multiplied into big O. So it gets pretty expensive pretty quickly. Good question. Okay. So this is more of the same thing. I've kind of front-loaded the explanation here but so this is the same thing. Yeah. No. So depending on your initial randomness, it can converge to certain local minima and it can get sucked. Sometimes it's not even clear that there is a global minima so it just arrives at different answers depending on your different initialization. Again, what you do is you run it 10 times and then you pick the most common one or you pick the best one that minimizes the global variance. Let me see if I missed anything at this interpretation slide. Okay, you get it, right? We're trying to summarize data about cluster membership. This cluster membership tells you some structure and then intuitively we're trying to minimize the amount of variance in each example. So K means sometimes it's referred to a centroid-based clustering because the mean of a collection of points is the centroid, like geometrically, or geometrically, or geometrically. So it will probably stick with distribution anyways. It's also called centroid-based clustering. Yeah, so it defines clusters using a notion of centrality. I prefer a distance metric, right? Or using L2, L1, some other cosine similarity, right? There's all these different distance metrics you can use to define what the center means or what the average means, right? So we use, in order to solve this in order to converge our clustering, we use the EM algorithm. Here's a probabilistic variant. You might be familiar with Gaussian mixture models. I don't know if we cover GMMs in this, but GMMs are also very popular. So if we don't mention Gaussian mixture models, definitely we'll look that up afterwards. And then again, it's very useful when the centrality assumption is good. So it's very good when similar things are similar. By your definition of what similarity is, if the center is a good representation of that cluster, then this is good, right? But there's like certain cases, and I'm kind of getting ahead of myself here, but you can imagine some cluster where it looks like this. Then k-means is not gonna work, right? Because the center of this is like here, and then the center of this to our visual line is over here. But then if you do k-means, right, you can see this is gonna take up like this, and then this is gonna cluster, I don't know, like this, right? So you can see how this can possibly break down for more complex scenarios where the centrality assumption falls apart. Any other questions about naive k-means? Cool. All right. So I'm kind of on this note as a thought experiment. In this case, what is good clustering? So there's two options here. There's like the clustering defined by the black circles, and then there is the clustering defined by the orange circles, right? So in a lot of cases like these, there isn't one clear answer or what the correct clustering is, right? Because it might be a task where we care about those individual clusters, and it might be a task where we care just about those big clusters overall. So instead of trying to answer that question upfront with the kind of method that you choose or by choosing k, what we can do instead is call a hierarchical clustering. So k-means use the centroid structure where we said that closeness is what we're interested in. Similarity is what we're interested in. Sometimes we want something like a linker structure, or a graph, right? Sometimes we just want a graph. You'll hear that a lot in computer science. It is a lot easier if we just represent the size of graph. So if we employ hierarchical clustering, it's also called, there's a, like one example is algorithmic clustering, I've pronounced that. Also linkage-based clustering because we're building links between data points. But the idea is, right, if we have a hierarchy of our data, we can decide where to cut, right? So at the lowest levels of the hierarchy, if you imagine a tree with all the data points at the bottom, and then some kind of binary tree structure going up to the top, right? If I cut it at the very top of this tree, then you have two clusters. And if you cut it nearer to the bottom of the tree, you'll get a lot of clusters, right? So this kind of gives us a lot more flexibility in terms of how many clusters do I want, what are the relationships between clusters, what clusters are near each other. It adds this, all types of, all types of information that can be useful. So let's go through an example of agglomerative clustering. So we have some data points. And what we're gonna start by doing is we're gonna draw some links, we're gonna draw edges, if you take them like graph stuff. We're gonna draw edges between the nearest points. So we're just gonna go through, and there's also a metric in terms of choosing which ones we draw a link between. But it's usually like a threshold. So we can say, if there are two points that are nearer than five or something, we're gonna iteratively go through and draw lines between those. And then after that, for each remaining nodes, we're gonna take the nearest point. So we're gonna take the nearest distances between points that are next to each other and then draw the lines between those. We're just gonna keep doing that. So for each data point, we say, what's the nearest point? And if that's already in the cluster, that's okay. We just keep drawing these edges. But note that we're keeping track of which edges we drew first. So not only are the locations of the edges important, but also the order in which the edges are drawn is also important. Okay, we keep doing that, doing that until we've covered the entire dataset. So all the edges connect all the subgraphs into a single connected component. Does this seem like something else to anyone else? Any similarities to existing algorithms or existing data structures? Okay, so this is just finding the minimum spanning tree if you've taken like CS30 or something like that. It's just Kruskal's algorithm, right? So that's where we say that we're trying to build a binary tree of some graph, right? We're trying to build it, actually the minimum spanning tree, I should say. And so we employed Kruskal's algorithm. But again, the key point here is we don't care just about the graph structure, but also the order in which the graph was drawn, right? So that we can say this is a cluster that incorporates smaller clusters within it, et cetera. So again, depending on which level we stop at, we stop here, we get two clusters. We stop here, we get three. We stop here, we get four, et cetera, right? So that's where the usefulness of the hierarchy of the cluster comes in. And then, yeah, also to point out that the order matters and then this is also equivalent to finding a binary tree partitioning with progressively smaller partition distances in that space and that language is familiar to. Also worth noting that this is big O, E log E, where E is the number of edges, the algorithm Kruskal's I mean. So it's a lot more computationally efficient than Naive, okay, right? There's still a cost of calculating including distances because you need to check what's nearest to everything, but building the tree itself is E log E, where E is the number of edges. Naive k-means is like O of like I said, four different parameters added together, or multiplied together, so it's very extensive. Naive k-means is also MP hard. So it's not, you won't see that being used a lot for massive data sets or anything like that. Actually, there's a lot of interesting Naive, there's a lot of interesting MP hard, okay. You know how like when you do MP hard analysis, you prove that one problem is MP hard because it can be like converted into another MP hard problem? Like one of those can be like Naive, I'm sorry, k-means clustering, so that's kind of, it's fine. Oh, look it up. You'll see some interesting CSVs down there. All right, so that's all we have for clustering. So again, we've covered, okay. So we have defined what unsupervised learning is, it's just learning without labeling. We covered k-means, which is centroid-based, which is that clusters, you assume clusters are clumped together around some central point. And then we covered linkage-based, which is that clusters can be organized hierarchically. It gives us a lot of flexibility in terms of how many clusters we have. And then the hierarchy itself might tell us something about the data as well. And then again, this works fantastic when clusters are the structure in your data that you're trying to. Any other questions about clustering in general? We're gonna move on to some other topic. Well, if you have time, I'll also mention there's other clustering methods based on other assumptions, right? So this is centroid, there's a centroid assumption, right? Where we say, centrist is what's important. We have linkage-based assumptions, which is we say there's some hierarchical structure already, there's another methods that are like density-based. So for example, if you've heard of DVScan before, that's a very popular method for clustering this type of, there's no longer any sense. So for clustering this type of data, it's very popular because DVScan doesn't have a centroid assumption as a density assumption. So it assumes that points in a cluster are more densely packed together. So I don't wanna draw a lot of points, but DVScan kind of does this thing where it starts at a point and it keeps jumping to points that are near to it, right? And the assumption is that if you keep doing that, eventually you'll reach the entire cluster by a chain almost jumping from a point to point. It's obviously way more complex than that. The intuition is that points in a cluster are near each other, so it'll stay in this structure and then it won't jump to another cluster that's some gap away, okay? DVScan is a density-based cluster. Then, again, I'm talking about this in two dimensions because it's easy to illustrate. Imagine this in a thousand dimensional space. All right, cool. So let's talk about what clustering doesn't work on. So what if you have this, you just have this one blob, this oblong blob of data. You can't cluster this because it's not gonna tell you anything, right? It'll tell you you have one cluster, that's so funny. If you have two, it'll probably draw one on that side and one on that side. But again, that's wrong, right? There's no two clusters in this. There's not gonna be a substantial enough difference between the two clusters to justify having two clusters. So what we're gonna talk about is instead called principal component analysis. Before I get into PCA, I wanna emphasize that PCA is not a machine learning method. PCA has predated machine learning by decades, if not centuries, right? We've had this for a while. This is like core linear algebraic constant. So please, yeah, don't get the wrong idea. Machine learning has its tendency, and it's funny because Eson in his recording also mentioned this. Machine learning has its tendency to kind of eat everything and kind of be like, oh, you're a machine, you've existed for 200 years, you're a machine learning now, congratulations, right? Or data science is now doing the same thing, two machine learning, which is pretty funny because machine learning has been doing it to other fields for a while now. But just know that this is like a really fundamental linear algebraic concept that it's not machine learning method, right? Not that it's not important, if anything, that makes it even more important. But just know that it's a linear algebra method. Okay, anyways. So instead of clustering, so instead of assigning membership to clusters, instead of grouping, instead of categorizing, maybe we want to describe the data in a way, right, that maximizes our understanding of the data. If you know PCA already, kind of feel what I'm going, I'm trying to lead up to it. So when we summarize data, okay, so when we summarize data, we're generally trying to understand the data by looking at fewer attributes or fewer dimensions, right? So clustering, we did this by replacing each data point by membership, right? Instead of saying you're XYZ, IJK, you are a member of this cluster, right? That's how we're doing summarization. PCA instead, the summarization via orthogonal projections by defining a new feature space via which we can interpret this data, right? And sometimes that means dimension-algebra reduction. You can certainly do dimension-algebra reduction, but even just doing PCA by itself kind of defines a new feature space by which the data becomes more interpretable. We'll go through some examples, okay? So for this particular example, what we have is this blob, right? And so this blob, let's say, we don't have the axes here, but let's say these blobs are defined in XY space, right? So when we look at these in XY space, because this blob is kind of tilted to the side because it's diagonal, right? The variance in X is limited from here to here, and then the variance in Y is limited from down here to there. And when I say variance, you can think of contrast almost, right? I'm just talking about the sheer absolute range of values that the data covers, okay? But what we can do instead is we can find a new axis. So new axes, right? We can find a new space, a transformed space on which we can project the data so that it has even more variance, even more contrast, even more expression over each of the individual axes, okay? So you can imagine on this, right? Before when we had the axes just up and across, the X-axis only covered this much variance. When we define the X-axis again on this tilted space, we draw it like this, you can imagine it got longer, right? The data covers a wider range. It's more expressed over the X-axis. So again, for this example, it's hard to see why that's important because it's in two dimensions. But there are certain methods where it is, this is important, right? We want each axes, axes is plural. We want each axis to contain as much variance and therefore as much information as possible. And then it gets even more important once we get into the national order reduction. But anyways, so we can achieve this new feature representation. So we're drawing these lines and then we're rotating it so that it becomes our new X and Y-axis. That's PCA. Okay, let's go into kind of the layout part of this. If you know PCA already from Linux, you probably know it as the identity composition of the covariance matrix. We're going inverse. We're going to start at what we want and then we're going to go back up to the identity composition, which is kind of interesting because I always learned it kind of in the identity composition of the covariance matrix side, but this is kind of a machine learning oriented slide. So we're having a new perspective. So I think you'll find this interesting. If you don't know what I'm talking about, don't worry, we'll get there. So we're going to define what an orthotic matrix is. So an orthogonal matrix is some matrix U. So a matrix U is orthogonal. If U times U transform or U transpose, not transpose, U times U transpose is the same as U transpose U is the identity matrix. So by identity, I mean a square matrix where there's ones all across the diagonal. This can be sparse. We don't want it to be sparse, but it could be sparse. Ones or zeros along the diagonal and then zeros everywhere else. So orthogonal matrices have this property that for any column, so I try to do this without drawing the matrix, but that's how I do a diagonal matrix. Anyways, so we have, this is U. So U transpose U times N equals some identity matrix. And then for any column, U, we just take one, so by definition, if we take any, we'll compete this here, right? If we take any column in this matrix U and then multiply by itself transpose, it's going to be one, right? Because of matrix multiplication, right? Because it multiplies, multiplies like this, right? Okay. So for any two columns, U and U prime that aren't equal if you multiply them at zero because that's what fills up the rest of these spots in this diagonal matrix or in this identity matrix. And so the kind of interpretation is we're trying to find, PCA is trying to find some U, right? Because then U, we can treat U as a rotation matrix. So U is some matrix that's going to transpose the data into this new feature space and then U transpose is actually the inverse rotation. It'll rotate it back into the original data form, okay? So that's what we're trying to arrive at. That's what we're trying to produce with PCA. So what we're trying to say, if X, X is the data point now that's external to everything. If we multiply X by U transpose, right? We arrive at this new feature space. I'm going to keep doing this, all right? So we have X, we multiply it by U, we end up here. U transpose, we end up here and then we multiply that back by U and then we end up back where we started, all right? So that's what we're trying to arrive at. Other properties of orthogonal matrices that are important is that first, it's norm preserving. So what that means is that this feature, this projection of this data point into the feature space, it maintains the norm. So it doesn't change the values themselves, right? It doesn't change the distances between points and things like that. And mathematically, if you have X, the projected, the product of the projected data point by its transpose is the same as the product of the original data point to its transpose, right? So it's norm preserving. So that's important for any kind of pre-processing transformation we want to do to the data set. We don't want to change the data itself, okay? And then also, more importantly, it preserves total bearings, right? So when we're drawing these lines, when we're drawing these axes, we're trying to maximize the variance expressed in each of the axes, but it doesn't change the overall variance of the data set, right? The point we're trying to make is that this doesn't change qualities about the data set itself. It just re-projects them into a space where they're easier to work with, okay? Where the axes are more expressive. This does express, this does assume a zero in and so we'll tend to subtract the mean from the data set when we do this. So for example, if we had this, right? If the axes were like here, we would subtract the mean so that the x, the zero, zero is in the middle here before we do this. All right, any questions about PCA so far? So again, we're going a little bit backwards. So we're starting with what we want, which is we want to define some U matrix so that all of these qualities are satisfied. So one of the motivations for doing PCA is dimensionality reduction. So one of the ways to reduce your dimensions is to just drop an axis, right? So if I have x, y, I could say, my model can't handle two dimensions. It can only handle one dimension. I'm just gonna use the x-axis, right? That's a decision I could make. So if we're gonna do that, again, it's of interest to have as much variance covered by that axis as possible, which is why this comes in useful, right? In fact, the concept is we can project all the data points onto a single axis to have a lower dimensionality representation of this data set, and PCA helps me pick or helps me find that axis by which it would be the best to describe the data set on just one axis. It helps me draw the best line, essentially. Okay. So again, we have this data. We draw the lines with, we draw the new lines with PCA. Take one column of white. So we have the U matrix, right? We have the full U matrix that we calculated. If we just take one column and then multiply it by the input data, we get this projection because of the way that projection matrix works. We were to multiply the entire U matrix by X. We would end up at this, right? But because we're only multiplying by a single column of U, we're projecting the data onto the space defined by that one column of the U matrix. That's where we end up. And then we reprojected it back, right? So remember how we talked about U gets you here and then U transpose gets you here and then U gets you back, right? If we try and get this back, we actually end up here because in this process, we have lost the second dimension entirely, right? We have thrown away all information contained in this other axis, are projecting the data into this new space and then projecting it back into its original space. So that's what we want. But the shape is the same, right? So we have the direction of the axis we can recover. But we can never recover this axis again, right? We project it and then we project it. So we can actually do this with any arbitrary subset of columns. So we're starting with 300 dimensions and then we only use three columns. We can project those 300 dimensional data point into a three dimensional data point, right? Or into a two dimensional data point. And then we can try to project it back, but we can't because we've lost those other dimensions. We can start with 100 dimensional data, project it into two dimensional space using this method by using two columns of the U matrix, right? And then that allows us to plot it. It's very useful for visualization. But then if you try to project it back, we'll lose all the information that we lost by dropping those columns. Does that make sense? Any questions about this process? Confusion or clarity, I can't really tell which, but anyways, we'll move on. Okay, so here's a formal definition of what PCA is accomplishing. We have some matrix M, I don't know why it says M, it's X here, M equals, right? We're defining some matrix of all the data, X, capital X. And then we subtract the mean center because PCA works on the mean of the data that has to be zero zero. Just take my word for that one. And then PCA decomposition, PCA is going to decompose X times X transpose into U, which is an ortho-matrix, a diagonal by capital lambda, right? Times U transpose. So if you've taken linear algebra, you're screaming, that's the covariance matrix. That's just the decomposition, an eigen decomposition of the covariance matrix. But we'll explain what that is for those that don't have a, that haven't taken linear algebra. So, but that's the formalism. We're from X, from the data, from the mean center data, we can generate this U, right? That gives us the best U axis to draw out, the best space to project our data to, for us to maximize the variance on each axis. All right, we're gonna keep going, which is, so we have this, we define that decomposition. Turns out, each column of U is an eigenvector of X times X transpose. And each lambda, in that lambda diagonal matrix, that center term by capital lambda, each lambda is an eigenvalue. You don't know what an eigenvalue or an eigenvector is, this won't really make that much sense. Things cover linear algebra again. I can't really teach you what an eigenvalue or an eigenvector is right now. But, of course, if you know what this is, then the point is, each column is an eigenvector and each lambda is an eigenvalue. In fact, for PCA specifically, this is not a minus, this is a bullet point, because power point formatting is done, but it sorts the eigenvalues, right? PCA specifically sorts the eigenvalue in order of magnitude. So the first column of the U matrix is the eigenvector that corresponds to the largest eigenvalue in that diagonal matrix. That sorting is important. So the interpretation is, we refer to X times X transpose as we're calculating the variance of each term to itself, right? Because we're multiplying it to the transpose of itself, right? So we're measuring how each term varies with respect to this. That's kind of into a general ideal idea. So we're taking this covariance matrix and then deriving the eigenvalue competition into this PCA solution, right matrix? And then the first column, U, the first column of the U matrix is the direction of greatest variation. It's always going to be this, right? Because it's the largest eigenvalue. And then the second is the direction of second greatest variation, because it's the second largest eigenvalue, et cetera. I lambda sub one is the total variation along U one, where U one is the eigenvector lambda sub one is the eigenvalue. So I'm gonna say this in three bullet points and that's all you need to understand, right? Especially if you don't have a linear algebra. The covariance matrix measures how each variable is associated with one another, right? Because it's multiplying it by itself. The direction of the spread of our data, the direction of the spread of our data is the eigenvector, right? And then the relative importance of each direction is in the eigenvalues, right? So as long as you know that, that's all you need to know in terms of PCA and its eigenvectors. Obviously, you can actually go into the line algebra and learn all those in order, if you're not already familiar with it, because I know we have a lot of different students within the line. Okay, okay, so the first column, so we keep talking about this first column and this first column defines the direction of high-experience, right? And we can also redefine this as minimizing the square plus of the reconstruction, right? So we talked about how we can take this original data, right, squeeze it down using the first column of view and then we can project it back using that same, using the transpose of that same vector, right? So if we're talking about this in like an optimization kind of a way, which is interesting, we're trying to find some direction that minimizes the loss after you go through this process, right? Between this original data and then the projected and inverted data. We're trying to lose the least amount of information after going through this process. I'm not gonna prove that, but here's the proof. If you wanna look at the slides later and look at how the use of one, the first column of the U-Metrix Sonder PCA, if you want to prove that that isn't the same direction that minimizes the loss after reprojection. There's the proof. So continuing on this line of thought, again, we're trying to find the direction that minimizes the residual square norm. Again, it's the same thing that I mentioned before. Trying to find the direction. We're trying to find the direction of the axis on which if I project and then bring it back, there's a minimum loss, okay? And then so on for another one, so on for another one. Also worth mentioning that all the vectors are republic. Is that true? I think that's true. That sounds true, yeah. All right, so solving PCA, you can do this with an iterative algorithm. I'm not sure why, because we have like decomposition methods. But if you wanted to do this iteratively, given some data for each dimension, you would solve this. We're now phrasing PCA in an ENM algorithm just to show that it can't be done. You wouldn't really do this because it can just take the time. But if you wanted to come up with PCA, with the ENM algorithm, where we solved or minimizing, we're minimizing the axis directions. We're finding the axis directions that minimize the information loss by projecting onto them, right? And then we're updating the residual loss from that. And then we're going back and doing that again. So the point is that we can frame PCA in an ENM algorithm if we wanted to, just to show that we can. Because that's computer science for you. All right, okay. I mentioned this already. I kind of got out of myself. But again, some property of PCA, the first K columns of U are guaranteed to be the K dimensional subspace that captures the most variability of X. That just means if I take U, I think the first K columns, the subspace defined by those first two columns is guaranteed to be the best K dimensional subspace to capture the most variance. Okay, so there is no better K dimensional subspace that can capture the most variance than the one found by PCA. We're just guaranteeing that the PCA is the best. And we just proved K equals one in a previous thing, this, when we framed PCA as an optimization task and then proved it for K equals one. So just do that for K equals two. So to explicitly define the dimensionality reduction that we do with PCA, you solve PCA, you find U, you can currently just call SK learn, right, PCA. And then we use the first K columns to define a K dimensionality representation. And then that gives us a summary of the original dataset. So a lot of times when we had a 100 dimensional, 200 dimensional datasets, this is one of the first things we do. Get them down into two dimensional or three dimensions and then plot it, right? Just do a scatter plot of the two dimensional or three dimensional PCA dimensionality reduction. And sometimes there's like very obvious clustering in which case you locked out. And in other cases, it's a total blob and like now we actually have to work, right? But if you can PCA it out to two dimensions and there's clear clusters, like, oh, then you're done, right? Just not clustering on that. Okay, so this is an example. This is a really good example, I think, because this kind of visualizes this kind of abstract concept of two dimensional space. So this is called eigenfaces. So what we do is we take a corpus of faces, we take a list of images of faces, and we treat each pixel as a feature. And then we run PCA and what we can do is we can actually visualize the eigenvectors that we get. So this is an eigenface. So it's saying that using these eigenfaces, using some linear combination of these eigenspaces, we can build eventually every face that was in the dataset. So I think this is an example of starting from the mean and then slowly reconstructing up to someone else's face. But again, if we can do this by, we take someone's face, right? We project it into K dimensions, right? So 15 or 10. It's originally like 300,000 because it's an image, right? We project it down into less dimensions, right? And then we project it back into image space. And then this is kind of what we can end up at. So we did this, so Yisam did this with previous years of CS155, where we have your images from like your directory. So this is like an average space that Yisam calculated from like four or five years of this course. And then these are the eigenfaces that you can get from OFAT, right? So again, the colors here don't even mean anything because once you get into the eigen space, like the values are negative and positive and close. This is just a color map representation of these faces. But the concept is with this average space, because we need to mean subtract the data, right? And then with each of these eigenfaces, if you have enough eigen space, eigenfaces, we can take some linear combination to reconstruct every face that was in the usual. Yeah, good question. So the number of, you have as many eigenvectors as you have like number of features, right? But we're just keeping like five or like that. But, and then this is like the top eigenface at the top there. It's like the most important direction, right? And then so on and so forth. It's like the top 12 eigenfaces one. And then from there, you know, you can do this. This is a previous student who's graduated, so we're allowed to use their image. So again, if we take, if we try to project this into the items, protect this using just five eigenfaces that we construct, that's what it looks like. We use 10, 15, 20, 30. The more and more eigenfaces we use, the more variance in the original data we can keep, right? That's the concept here. So that's the amount of information we keep. We take like the 300,000 dimensional data and then we compress it down to five. Which is kind of impressive if you kind of think about it. Five dimensions and that's what you get. We're like maybe 30 dimensions compared to like 300,000 here, right? Just 30 faces linearly combined gives you that compared to like the original input. Kind of impressive. All right. So I did it for this year. Ooh, I did this last night at like 40 up. All right, so, let me pull up the, we don't have that much time. So maybe we have time for one or two examples. I don't know how to get rid of that. But I took all of your directory photos from the website or the registrar, because I have that power. And then I did do one thing that you something to do, which is I normalize for all the face locations. So when you do this, it's kind of important that the face is in the same place. Because when you take the average, I get the different faces everywhere, then it's not gonna be right. So I took this, this is running on my laptop so this will take a second. I shouldn't take that long to import it on my, right? I hate live demos. All right, load the data. All right, so now this is the face normalization. So I used a hard cascade. Someone asked me about face detection without deep learning at some point. This is how you do it. This is a cascade classifier, a hard cascade, where someone has gone in and defined what a face looks like already. So this is the mean face. Whoops. Oh no, don't go ahead. Not like me, yeah. Okay, there is a mean face. So it looks more like a face because I normalize for all the idle locations. So I'm pretty happy about that. This is for this year only. This is for just for 2023. Okay, and then if we do this, these are the top 12 eigenfaces from that data set, right? So from the 189 images that I had. You'll notice that some of this is handling the variance of the background, right? So some photos have a blue background. And so eigenfaces need to kind of cover that variance as well. So you'll see that. Okay. Does anyone want to volunteer their face? I'm serious. Yeah, okay. Find age four. So what this is doing, we're using different number of components to reconstruct the profile picture. Oh, that's the wrong one. I'm asking for it. Anyways. All right, well, that's just, this is random, I guess. Oh no, no, picture three, picture four, real two, all right. Anyways, this is someone random, but I'll walk it out in the recording. Anyways, so you can see, you can see that using just one eigenface, right? We already get, if you remember the mean image, right? It's kind of like a generic androgynous face, right? But even with one eigenface, we can get kind of a gender representation almost, right? At least a hair representation, right? And then as you use more eigenfaces, right, it gets closer and closer to reconstruction until you end up at the original input, which is on the right here. I don't know why that didn't work. I spent so much time on making sure the columns are right. I don't even see that image on my roster. Oh, there's some skip, all right. Anyways, there's bugs. Always, live demos. Anyway, so this is the concept. We're doing this randomly anyway, so now I'm just gonna pick random numbers. It's not gonna be in the recording. I'm gonna block out the screen in the recording and then everyone will be sorry that they messed it up. Another one, right? So we have the mean face. This one's kind of interesting because the face position is a little different. You can almost see it morphing as we do more and more linear combinations. Was this interesting? Was this worth doing out like for now? All right, cool. Yeah, yeah, yeah. So it's literally, so each, you can treat each of these faces as eigenfactors and then you multiply that. So those become the columns, right? And so you can kind of think of it as when you multiply the input by those columns, we're projecting that input image into the space defined by those columns and then we're projecting it back into the input space. So it's a linear, it's a combination, I shouldn't say linear combination. It's a combination of these eigenfaces, but kind of conceptually you can think of it as we're stacking them on top of each other with some weights, right? And then they add up to be, that's kind of not even two of the way I was thinking about it, but each of these faces defines some axis in the like 300,000 dimensional space of the pixel line. No, these aren't grayscale, this is RGB, but this is an eigenvector. And so the RGB values don't mean anything anymore. So I did have to scale this like min max. So the colors aren't real here. It's kind of fake color, pseudo color, yeah. Can you use that to create a face that doesn't exist in the data set? Can you use that to create, if you had, if you could define a face in the latent space, so in the space that these eigenfaces define, if you could go in and put in numbers, yeah, it would give you something back. I don't know if it would give back something good because these eigenvectors don't have any like semantic meaning, right? It's literally just looking at our RGB values. So you somehow, it'd be hard. Theoretically it's possible, it'd be hard with this representation. There's other ways like in deep learning methods where this embedding is more semantically meaningful. So sometimes there are ways to do this where, and we'll talk about this, where each vector, like each axis defines like nose shape and like skin color and like face structure and like things like that. In that sense, you could shift those numbers around and it would actually generate and it's not gonna get you. But this particular example with PCA it's to kind of lock them to the individual pixel values. Yeah, that's fine. I would upload this, but again, it's like privacy stuff, so I'm not gonna do it, but if you wanna come up and like turn around with it and try to get your own one, you're like, okay, all right. Okay, so we have one last concept to cover which is singular value decomposition. So again, if you've taken linear algebra, you're aware of this, where instead of taking the eigen decomposition of the covariance matrix, we just operate on the input matrix X directly and then we can decompose that down to an orthogonal U, some diagonal matrix and then some orthogonal matrix on it here. So again, if you haven't run into this in linear algebra before, don't worry about it. You wanna learn it because you need to catch up on linear algebra, linear algebra, and I'm just trying to draw a comparison between PCA and SPD here because they're very relevant subject. In fact, SPD and PCA is equivalent because PCA is just working on the covariance space. If you break it apart into the SPD decomposition, you just end up with the same thing. And then here, Yisam goes over kind of how he did eigenfaces using SPD. I did eigenfaces using PCA, so this isn't quite relevant, but if you were to do it with SPD, you would do the decomposition down to U sigma B and then he did this thing where he took a square root of the diagonal matrix and threw it into U and B. I don't know if that's legit, but it seems to work, so I don't know the theoretical basis behind this because I didn't do this right. This is how he did eigenfaces, but just know that one way to do it, you can do the similar process with SPD as well as PCA does. Oh, I thought he was doing some kind of interpolation, but I don't think so. Definitely wouldn't work for it. But yeah, I think this is just a, I don't think so. Okay, so again, like I mentioned, a good question by the way on the limitation eigenfaces is that each dimension is a pixel, which doesn't mean anything. Semantically, we would prefer our embedding to be semantically meaningful, segue for a future lecture. But if a dimension had more meaning, we'd have a much clearer visualization. All right, in summary, clustering and PCA are two unsupervised methods that reduce the dimensionality of data representation in one way or the other. Clustering does this by defining the dimensionality as cluster membership, whereas PCNSD actually finds the best axes that explain the most variance about the data, and then we can throw away some of those axes, right? And we can trust that the remaining axes characterize as much as possible about the barriers of the original dataset using those less axes, right? Using those less dimensions. Coefficient, yeah, and k-dimensional projection. And then this kind of allows us down the line to do nice visualizations and nice interpretations. And again, clustering and PCA are some of the first things we do on data science and machine learning, where we need to understand our data before we even start labeling it, right? Because maybe labels aren't required, or maybe there are structures, maybe there are anomalies, right? Dimension, L2 reduction is really good at doing anomaly. In fact, there's an anomaly detection method where we train, we fit PCA on normal data, and then if abnormal data comes in, the reconstruction's really bad, because the PCA hasn't expected the anomaly in the anomalous data. So you can do anomaly detection with this. My previous project called DEMOD kind of used this concept to do anomaly detection and like discovery of content, discovery in large datasets. This is a huge field, right? It's not just this. There's a lot of additional things you can do to help them be actually useful. And then again, it helps us with generating nice visualizations and interpretations for that initial step. You'll often hear it being called exploratory data analysis, right? Where given some dataset, we're just messing with it to see what's in it, what are the correlations, what are the clusters, what are the dimensions, right? Okay, so next lecture is on Thursday. So beyond interpretability and explainability, again by Dr. Lucas-Mandrick, I know it's midterms, but if you come and just work on your midterm in class, I don't care, nor does Dr. Lucas-Mandrick. It's just that he's a pretty good speaker and it's kind of a shame to see him through the screen. So if you have a chance to come or maybe take a break from midterms or assignments or you guys are like totally stressed out by like way too much work all the time, I get it, but if you can come and listen to Dr. Lucas-Mandrick, I would really appreciate it. And then if there's time left over, because I don't know how long of a presentation he can do, I'll just do office hours afterwards anyways. I do have office hours after class today for one hour and an hour, we can walk over, same thing on Thursday and then Friday morning, okay? All right, thanks everyone.