 So I'm going to talk about dimensionality reduction, and I invite everyone to just, you know, ask questions as they come along. In this lecture, you know, one thing will build on another, we'll do some tutorials in between. But it's kind of important that you follow each step, because each step builds on what was done previously, okay? So what's dimensionality reduction? Or can have a show of hands? Who of you would consider themselves more experimentalists, people that acquire data? And who's more on the theoretical side? Okay, so it's a bit of a mixture, and some are both, that's good. So nowadays, when you collect data, you often collect gigabytes of data, and then there's a problem of what to do with it, okay? So I have some examples. So one example could, for instance, be record, you know, EEG from a human over many hours, you have many different channels, and then afterwards you try to understand what was happening. Another example, that one is particularly terrifying, is for instance, you do calcium imaging in an animal such as the zebrafish. The zebrafish is transparent, so people can basically now image the whole brain, that's like say something between 100,000 or a million neurons, and you can image that over eight hours. So if someone gives you the data, and a colleague of mine does this type of imaging, you know, it comes on a hard disk, and it's in the order of terabytes. I personally already gave up on that, because I couldn't load it into my computer, but you know, some of you may have to deal with that. Another example, which I'm a lot more familiar with, is large scale electrophysiology, where you basically record, you know, hundreds or thousands of neurons, either simultaneously over also, or subsequently over many days, and then you want to analyze that. And that is one of the things we're actually going to do today and tomorrow. We're going to analyze some electrophysiological data, but the techniques you're going to learn, you can apply to pretty much any dataset. So I know that, especially if you're more from the experimental world, and you just come out of the masses or so, your thinking may be a little bit the following. You know, you get this data, and then you're kind of looking for that software package that you can push your data through to get out a result. Unfortunately, you know, the state of the field is not quite at that level, where we can just tell you here's the way you're supposed to analyze your data, because we don't know how to analyze your data. And that means, unfortunately, that you have to learn yourself how to analyze your data and how to extract things. So this particular way actually usually does not work. You need to understand what you're doing. Now, one way in which you could handle the problem is, and I think many people do, and it's kind of like where I get my job security from, is to replace that software package by a theorist. So someone that helps you analyze your data. But that's also not really what you want as an experimentalist. You don't just want to outsource the problem to someone else that then tells you what to do with your data. What you really want, and that is the goal of this lecture, what you really want is you want to be able to understand yourself what to do with the data. Then you can still collaborate with people, et cetera, but I think it's important to gain an understanding what the methods that you're going to use actually do with your data. Both maybe because you want to understand things, but also to be on the safe side of not making what with hindsight may seem silly mistakes. So if we have population data, so data from many channels, many neurons or something like that, it'll usually come in the format of a matrix, for instance. So here, just as an example, you know, you could imagine that you have all these different channels. So in our case, the channels would be different neurons. So maybe they're n channels. And then you collect data over time. So each row here is, in some sense, a time series of one of these channels. So x11 will basically be the first variable in the first channel at time point number one, x12 will be the first channel at time point number two, and so on. So you can imagine that your data comes in this big matrix. So that would basically be true for these examples that I've just shown you. The problem with the matrix is that it's really big. You may not even be able to plot the data. Now, if you have that type of data, then the type of methods that you may apply to it generally fall under the term of unsupervised learning. Because what you may want to do is just find structure in those channels. Find, for instance, points in time where all the channels co-vary or something like that. So some type of structure in the channels that you just want to extract without knowing anything else. So then we call unsupervised learning. And examples of that, for instance, are principle component analysis, clustering, and then many more. Now, in many experiments in neuroscience, what you will have, though, is you won't just have these channels that you record, but you'll also have some extra parameters. So say, for instance, that in the case of the EEG recording, you have a subject that also does some task, and so you're measuring parameters of that task. And the same could be true for the zebrafish that may be seeing certain images as some visual stimuli. And so there are these extra parameters or variables that you have. So you can imagine that you have one matrix which basically captures what you're recording, and you have another matrix which captures all these labels that basically tell you what's happening at different points in time in the environment or what the animal is doing at different points in time. The subject is doing. So there will be a second set of variables y. And then the type, I mean, there are many types of questions you could ask, but a common question could be to ask, find structure in the channels x that somehow allows you to predict y. Obviously, you could also turn around and say, find some structure in these extra parameters y, something about the stimuli or the behavior that explains structure in x. But in either case, you want to take one and explain the other. And then the general type of methods that you would apply would be called supervised. An example of supervised methods are, for instance, regression, such as linear regression, clustering, if it's supervised clustering, classification in that case, et cetera. And the goal of dimensionality reduction is the following. Given that some of these data sets are very big and we're mostly focused on this matrix x, what you'd ideally want to do is not just use that huge matrix x, but to first extract structure in x so that you can make it low dimensional, that you can cook it down, compress it in a meaningful sense, and then you can apply your unsupervised method or your supervised method, et cetera. And so the question is, why would you want to do that? Why would you want to find the structure in x in either of these cases? And I'd say there are several examples of why you would want to do it. So a very simple thing is, well, you know, you could just say that's a way of compressing the data and that saves time and storage on the computer. And of course, you know, if you nowadays have a movie or you just have a JPEG image, they are compressed versions of the original image. So that's a way in which, you know, safe space. Another reason you would want to do it, and that's the one that will interest us more, is to be able to even just look at what is going on in your data. And I'll show you why that is a problem with the example that we're going to go through. You know, what is even in the data if you have so many channels? How do you look at it? Looking at it is very important to sort of find something interesting. And then, of course, as I mentioned, you can also use then the reduced data as input into other methods. But those are the general goals of dimensionality reduction. So before we go to dimensionality reduction, though, I want to go actually through a simple exercise that will also maybe make clear why you actually need, or why this is a useful sort of topic to know something about. And that is, we're going to, you know, look at a particular example, look at the spike trains that were recorded in this example, and extract them in the classical way, you know, by computing peristimulus time histograms, et cetera, over many, many, many neurons. And just have you wonder, you know, what do you do with that? That'll be the exercise until the break, roughly. So I call it sorting and averaging. And the example that I'm, that we're going to work on is the following. It's a classical working memory task, more than 20 years old by now, done by Reneau-Ferromo, and it works as follows. You have a monkey that basically receives a vibratory stimulus frequency on its fingertip. Okay, it's a little bass, basically, if you do it yourself, a little vibration. And that vibration has a particular frequency and lasts for five hundred milliseconds. Then there's a delay. The delay is generally three seconds long. Okay? And then there's a second stimulus. And the second stimulus, again, has a particular frequency. Now, the task of the monkey is to determine whether the first stimulus frequency was larger than the second stimulus frequency. And the monkey basically has two buttons. One, if it was larger, one, if it was smaller, and it has to press the two buttons. And if it gets it right, it gets a little juice reward afterwards. So that's the motivation for the monkey to actually participate in this task. Then there are various combinations of stimuli that you could show in this experiment that we are going to do. These are the stimuli that were shown. So there's this base stimulus, the first stimulus, that comes at different frequencies. And then there's the comparison stimulus that generally just comes at two different frequencies. So the frequencies here on the X and the Y axis, the numbers in these little boxes, so the boxes tell you one particular combination that the monkey would receive and the number in the box actually tells you the percentage of trials in which the monkey got it correct. And the exercise that will only work on the, will ignore the correct versus incorrect for simplicity. And so what Ranoff did is he basically recorded in many areas of the brain while the monkey is doing this task. So it gives you a view of what's going on in the monkey brain when monkeys are doing this particular task. We'll focus on an area that's called the prefrontal cortex, a frontal area that's generally believed to be responsible for working memory. And here you have an example of one neuron recorded in the PFC. So on the X axis is time. On the Y axis are different combinations of stimuli. So actually what's shown in this case is the first stimulus frequency, 18 hertz, then the first stimulus frequency, 22 hertz, and then in this case the first stimulus frequency, either 10 or 26 hertz and then depending on the decision of the monkey, which I'm not entirely sure I'm explaining this correct, and then you see for each of these blocks basically several trials and the spike train of the neuron in these trials. That's what you essentially see. So for each of the trials where say the first stimulus frequency was 18 hertz, you see all the spike trains, 10 different trials, 10 different spike trains of this neuron. That's a raster plot, a common way of plotting the spike trains of neurons. And now what we can do is we can basically turn this type of raster plot into what's called a PSDH or Paris Stimulus Time Histogram, where you basically take for each of these blocks, so if you have one of these blocks, you just average all of these spike trains and then you smooth it with a convolution, so for instance with a Gaussian filter you smooth it and that gives you an estimate of the firing rate, the time varying firing rate of the neuron. And if you do it for this particular neuron, this is what you would get. So here's time and second, this is the first stimulus frequency, second stimulus frequency, and here what I do is I color coded the first stimulus frequencies so that red basically means F1 was 10 hertz, blue means F1 was 34 hertz, and then here in the second period I color coded according to the decision so whether the monkey decided yes or no, and you kind of see, you know, this neuron's firing ramps up during the delay period and the colors segregate according to the first stimulus frequency, so see that the neuron fires more, in this case if the frequency was lower and it fires less if the first stimulus frequency was higher. And generally you could say that is a correlate of the short-term memory of the monkey, since the monkey has to remember the first stimulus frequency in this delay period, what you can see here is something that correlates with that short-term memory because there's information about the first stimulus frequency F1 in the firing of this particular cell. Now the problem in the prefrontal cortex is that not all cells fire like this, if they all would fire like that, that would be great, but that's not the case. Rather what you have is that each cell does something different. So here I'm showing you nine cells and to be honest, Ranofer has recorded on the order of thousands of cells. And so if we just look at these nine cells you'll see that each cell does something different. They all have somewhat different dynamics. Some cells like this one are very nice because you see the spreading of the colors during the delay period. So there's some type of short-term memory. Then you see that the decision here separates into two different type of firing rates. So there's a nice correlate of the decision that the monkey is later going to take. So you can predict the monkey's decision, which of the two buttons is going to press. So this cell combines both the short-term memory and the decision. But then you have cells like this one, you know that do have different responses for the different frequencies, that participate in the decision. Or a cell like this one, that does not participate in the short-term memory but then participates in the decision nonetheless. And so you find all kind of variety. And you can keep playing that game. And in fact, the very first thing I did when I got this data set is send all those cells to the printer and then come back with, say, 1,000 pages of individual neurons and then you go through it. It's kind of intriguing. It's kind of nice to look at it, but it's not extremely enlightening that afterwards you say, I can now understand what the system is doing. And so our first example will be to extract the data from the raw MATLAB files that I originally got or that were no more basically packaged at some point the way they handle the data and produce these PSD edges. That'll be the exercise for more or less the next hour. And so I'll explain a little bit. And you should all have this access to this GIT type engine, whatever you want to call it, on which earlier I uploaded a folder called Machen's tutorial. And in that folder, there is a file called data.zip or data.tgz, depending on your operating system. And you should try to decompress that because that is the data. Okay? But before you do that actually, let me just very briefly explain again what the task is. So that you just have it in your mind, actually. So here's the way the data is organized and then we'll look at how it's organized in the MATLAB file. But just to get a more principle idea of how the data is organized. So it's recording of spike trains and they're sorted already in trials. So the way it's done is that for each of the trials, you get the following information. So we trial one, what the second stimulus frequency is. What the second stimulus frequency is. What the decision is. Turns out the decision you'll have to compute, but you can compute it easily from the information available. And then the spike trains for seven neurons. That's because we're recording seven electrodes at the same time. Okay? Nowadays that may not sound very dramatic, but I think 20 years ago there was really a state of the art. So seven electrodes at the same time, individually movable, related sorted spike trains in terms of their spike times in milliseconds in seven of these channels. Okay? That's what you're going to get. And then there will be trial two and trial two again in frequency one, frequency two, the decision, and the spike times of seven, just the second, seven neurons. And then you'll also get some information about what the time of the stimulus onset was for the first stimulus. So give you an alignment along which you can basically compare then the spike times of these different neurons. But the next thing we're going to do is we're going to basically move towards actually dimensionality reduction. Before we do so, just a quick wrap up of this afternoon session. So in the end, if you run all of these scripts and you run this last one, Romo, all PSDH, it'll generate a long MATLAB file, which I think is also part of the folder which is an array which contains an array of your number of neurons and I think there are 370 in the end, number of conditions, and I think in this case there are actually 12 conditions. So six frequencies, f1 and two decisions and then number of time points. And that's the array we're going to work with later to run PCA. So but let me now continue, how do you now analyze this data? You've looked at one cell, two more cells, and you see that it's interesting to look at the individual cells but it's hard to get a feeling for what is happening in the population. One of the problems that you basically face or that we face is that these cells respond in various ways with respect to the task parameters. So imagine, for instance, you were to sort them according to the stimulus, the decision, which is what we already did, but you could also sort them whether the animal got a reward in the end or not. This task actually, if the animal gets it right, it'll always get a reward. And then you can sort of imagine that you have these different colors to indicate whether a cell responds to one of these parameters. And then one way of looking at this set of neurons that you have is to notice that some of them respond very strongly to the stimulus, so they get very yellow. Some of them respond strongly to the reward, so they'll get very blue. And so they get very green. But different cells mix these different bits of information in various ways. So people have called that mixed selectivity and that has been a big problem in the field. So if you talked to people in the field 15 years ago, they would be very frustrated with understanding the prefrontal cortex because all cells do all kinds of stuff. So it was hard to basically sort through them to an individual neuron. So a neuron has like selects different task parameters or is selective with respect to different task parameters in different combinations, and that's called mixed selectivity. So classically, here's how people would basically try to make sense of the data. So they would say, well, let's group the neurons. So you apply your favorite statistical test to sort of see, you know, are they selective for blue, green or yellow? Or in this case, stimulus decision reward. And then you say all the neurons that respond to the reward, I'm going to put into one group, all those that respond to decision, I'll put in another group and the yellow ones in another group. And then the groups can be overlapping. And then what you would do is just take the averages of those groups. So what is called the population average. So if you read literature, on higher order brain areas, they often have these population averages where they average over the most determined by some criteria. And so here's the way of understanding what you did in some sense. So you can imagine in this task that we're just investigating, you know, here's time, at time zero, the first stimulus frequency comes on, lasts for half a second, then there's a delay, and at 3.5 seconds, the second stimulus frequency comes on and you can already read out the decision of the monkey. So one way of, you know, what fraction of cells is selective, say, for the stimulus, for the decision, etc. So there will be classical analysis. So you would see maybe initially, you know, they're not selective for either, and they better not be, because there's no information about either. Then 25% of the cells become selective to the stimulus, then the number of cells that are selective decreases over time, then it increases a bit. And when the second stimulus frequency comes on, up to 40% of the cells okay, so you get some idea of what is happening in the data. Then another thing you could do is you could say, let's look at one point in time here, for instance this point in time, and let's basically look at these cells. So how did we determine these cells? Well, in this case we used to test ANOVA effect size, and you basically make it a significance level 0.05, then you would say, well, these guys here that are dark gray, those are the ones that were significant, and the ones that are light gray are not significant. And then you can average this half and that half if you want. And that would be a population average. And if you do an average PSDH, so here we basically average these guys together with these, so we just flip the sign in this case, and then this would be the average for the stimulus, this is the average population activity for the decision, so we do the whole thing here, and it gives you some course view of what is happening in the data. But then you may wonder, did I have the right representation of the data? Is this really what is going on? So that is the sort of ad hoc method, something that you may not be so sure about afterwards, that you really captured what was going on in the data. So an alternative to this approach is to use unsupervised methods, such as principle component analysis, that try to just give you a summary of the data without you know any bias in the way you look at it. And that is what we are going to do first. And then tomorrow we look at some steps beyond that method that we developed ourselves in terms of how to look at that data. Well the bias is in the experimenter, or in the person that does the data analysis, that bias is hard to get rid of. But the methods have different trade-offs so let's put it this way. That's the way I would phrase it. So it's not like the method is wrong. It's a particular method and you get out a particular answer. And I think the thing that you have to learn is what do you get out given the assumptions that you put in. That's like the crucial thing, there's no right or wrong method. The key thing to learn is what are the assumptions that go in, what do you get out and what did you learn from the whole thing. That is like the key thing to learn. So it's not that PCA is better than this, or worse than this. It's just a different look at the data. And then tomorrow we'll do Demix PCA and that is not better or worse either. It's just another different way of looking at the data. So we're going to basically now learn something about principal component analysis. Who's had the principal component analysis before? All right. Who feels like familiar and can implement it from scratch in Matlab or Python? Okay, so there are a few. But that's what we're going to do. We're going to implement it from scratch. To do that we'll need to understand it in depth first. So we're going to ignore the task context again. We'll just look at the spike trains without thinking about the frequencies, etc. So here's a recording, it's actually from auditory cortex but it doesn't matter. You have 130 neurons and these are recorded simultaneously but again it doesn't really matter for what we're going to do. And here's one way in which you could sort of think about this activity which we already did in some sense. Basically what you do is you take the spike trains within a particular window here and you turn them into spike counts of firing rates. So we did it by smoothing but you can also imagine that you just do it by window so many people do that. Then you basically get this column in the matrix that would have the individual firing rates doing this little window here and the n neurons here so there are 130 neurons in time step one and you can imagine this 113 dimensional vector, because that's what it is as a point in a 130 dimensional space. So I only plot the first two dimensions of this space because I'm more of a two-dimensional guy. And then basically you get a particular point that tells you, okay, this is where you are in this space. So that would be the firing rate of the population at that particular time point. Then you can do the same thing for the next time step. That gives you another vector and that other vector is another point in that 130 dimensional space. And as you keep doing that you get a whole bunch of points. Of course they're connected through time but for the purposes of PCA we're actually going to just ignore the fact for us they're just points of this point. They're points. And one thing that you can sort of imagine if you have lots of these points over recording, the question you could ask as well, do they really span all 130 dimensions of that 130 dimensional space? Or maybe not. So in this little example you see there are two neurons but they don't really cover the whole two-dimensional space. They kind of lie along this line and similarly in 130 dimensional space you could imagine that the points may be lying in a 6 or 7-dimensional space. And then you may say well maybe it's useful to characterize that 6 or 7-dimensional space as opposed to thinking about this as a 130-dimensional problem. And that is the thing that PCA does for you. So let's imagine that is basically our data. Sometimes it's also called the state space of firing rates. Or the state space. So what you first want to do is you actually want to center the data. Centering data is a common pre-processing step that just makes the math easier afterwards. And it's not particularly deep. It just kind of means you shift your whole cloud to the center of the coordinate system. And in practice what you do is you compute the mean of this cloud which is kind of like the center of math and physics. So you take the mean in this direction, or if you want the average of all those 113 rows that you collected in that matrix and you subtract it. So from each data point we subtract the mean. If we do that, it's the mean over the rows. Yes, each row is one neuron and so the mean is just the average firing rate of that neuron over time. It's just one number basically. So that's what you subtract. And this is what it means. Please note the subtle difference. The subtle difference is the coordinate system was previously centered at 2 hertz. It's now centered at 0. So officially these numbers here are non-negative with respect to the first or second neuron. You have negative firing rates but it's really just because you subtracted the mean. So that's called centering. Very common. Not just in PCA also for regression etc. Many things get easier if you center the data first. Now the next step that we're going to do in PCA is we're going to project the data. So what is a projection? So graphically it's shown here. The first thing you need if you want to project the data in this case on just a line is you need a vector that characterizes that line. That vector is called U here and it's a unit vector so there's length 1. Mathematically if you want to project onto that line what you want to get is blue points and you want to get that red point back. And this is the equation that does it for you. So xi is the original vector. You take U times xi, U transpose times xi, that's the inner product or the dot product who has not heard of the inner product or I should ask it positively. Who knows what the inner product or dot product is? So most of you know it. So U transpose xi that is an inner product or a dot product. That gives you a single number. And that single number tells you how far from this zero point you have to go out along U. So it tells you the distance from here to this red point. And then what you do is you multiply this number with the vector U and that gives you back this point now as a two dimensional point. That's basically what a projection is. So that's actually the key thing to learn in PCA is what is a projection. I think it's like there are only two things to learn in PCA. One is what eigenvectors, the other one is more projections. So projection is a really key concept. And I'll illustrate graphically what the PCA is. But basically each of these points gets projected onto this line. The line is defined by the vector U. This is the equation that does the projection for you. So the red vectors PI, those are the red dots here. Now the key question that you then ask in PCA is what line should I project onto? And you're trying to find the line that essentially maximizes the variance of the data. So what I'm showing you here is just different possible projections. So you see that the vector U is sort of rotating around and so is this line rotating around and you see that depending on where the line is the projection is different. And so what I want you to think about now is at what point is variance of the red dots along that line maximized? When you have the maximum spread along the red line and maybe you can just show it now. Now, exactly. So that's when you have the maximum spread. That's the axis you want to find of the maximum spread. That's one way of interpreting what you do in PCA. Maximum variance. But there's another way of interpreting what you do in PCA. And that is to think at what point are these red dots here as close as possible to the blue dots? So graphically what that means is you want to minimize the distance between these blue dots and the red dots and that will happen and all these red lines here are as small as possible or as short as possible. That's sort of the explanation of PCA as the best compression of the data given that you go from two to one dimension. You want to be as close as possible to the original points. And now think when that's going to happen. Now, and the point is it happens at exactly the same point. So the two are equivalent. So whether you try to maximize the variance along this direction or whether you try to minimize the red lines that's mathematically exactly the same thing. No. It's not. Because linear regression has a different metric. I cannot stop this actually. But linear regression would basically if you have it from here to here linear regression only maximizes and minimizes in this direction because you're only interested in y. So in linear regression the output is the second axis and that's your error. You only care about the error on the second axis. You don't actually care about the first axis is fixed. So that's why you only care about the error in this direction. So, and the projection the error of these red lines you can mathematically formulate that shown here it's basically the sum of all data points. So it goes from one to t in this case of time points. X i is the vector one of these blue points here of your data points. U U transpose X i that's your projection and X i minus the projection in this funny double line that's called a norm that's basically the length it's just a shortcut for computing the Euclidean metric applying Pythagoras if you want for the length between X i and the points P i and then you square that length and the reason you're squared is just because that's mathematically convenient. So that's the distance and then you sum the distance over all points. So that gives you the overall distance so all these red lines squared and summed together and then you say that's what I want to minimize. So that's how you formulate PCA for instance. So you say I want to minimize this with respect to what actually so you minimize but what is it that you're changing to minimize it? U exactly you change the direction U. Under the constraint that U has unit length that's also important. That is the reconstruction error exactly yes that's the reconstruction error projection error is reconstruction error. Okay and then as a little math exercise for those of you are curious you can if you multiply this out reformulate this as this equation which is minus U transpose times the sum over xi xi transpose these are the data points and this actually is the covariance of the data points times U and this is the other formulation of PCA so just another it's just summing this out basically multiplying it out and then this is U transpose times the covariance times U the covariance is a matrix and that basically just tells you also supposed to minimize the negative variance which is maximizing the variance so the minimum projection error is maximum variance. Now the way we're going to do it numerically is going with a maximum variance formulation because that one is easier to handle so we're going to say L is basically U transpose times this sum of all these data points times the data points transpose times U now this sum here xi xi transpose xi xi transpose is an outer product who knows what's an outer product it's just an inner product is basically a row vector times a column vector and an outer product is a column vector times a row vector the complication of the outer product is it gives you a matrix and this is basically a sum of outer product so it's a sum of matrices and if you sum them all up and you divide by the number of data points you have then you call that the covariance matrix if the data is centered it's a covariance and you're supposed to minimize this or maximize this under the constraint that U is 1 and you can do that just by taking the derivative under a constraint so it is a Lagrangian constraint and actually I don't think I'm going to do this for you I'm just going to give you the solution of that with respect to the vector U so it may sound terrible to take a derivative with respect to a vector but it actually just means you take a derivative with respect to all the individual elements of the vectors you have to take n derivatives or you can just look up a vector derivative in the matrix cookbook a nice resource for taking these kind of things so if you take the vector derivative then you will figure out this is the solution c times U equals lambda times U okay and this solution is known as the eigenvector equation in linear algebra okay who is familiar with eigenvectors and eigenvalues okay so many so the key thing to basically notice is basically that U is now called an eigenvector c is the covariance matrix U is an eigenvector lambda is the eigenvalue and in terms of we really just need to understand that but maybe I briefly explain the idea of an eigenvector so a matrix times a vector you can understand that you start with a vector you push it through the matrix and you get out another vector if that other vector didn't change direction but it points still in the same direction then it's called an eigenvector of the matrix and lambda is just a scale vector that tells you how things scale and given that the covariance matrix is a positive definite matrix we know from mathematics that lambda will always be positive lambda also has a special meaning because it's actually the variance of the data in the direction of U so that's what the eigenvalue means the variance of the data in the direction of U and U will be the principal axis on which you want to project the data so for now we just try to find one principal axis the best the interesting thing of this equation is that it doesn't just tell you the first principal axis it will also tell you the next best principal axis so the best two-dimensional subspace the best three-dimensional subspace four-dimensional subspace etc and the nice thing is if you use MATLAB or also Python this is just a one-liner to get back basically you say you know you get this covariance matrix C and then MATLAB for instance will tell you all these vectors U and all the lambdas so that's quick and then what we'll have to do is basically compute the covariance matrix from the data then push the covariance matrix through the eigenvector formulation of MATLAB that gives us back these eigenvectors and then we project the data onto these eigenvectors and plot it and then we've done PCA now one way of thinking about PCA is that it's in some sense it's a bit like a bottleneck so you have these neural activities here it could be 113 or 370 we will have and you push them through this principal component which is the projection onto one axis and then you basically you map it back onto the original neurons and then you can compare these projected activities with the original activities and that difference that is what you're trying to minimize and of course you can have a bottleneck that's not just one axis but there could be several axes so it could be for instance a three-dimensional space so you can think of it like this and what we're going to do is we're just going to apply it now to this working memory task that we already worked with earlier this is the type of population data we have so here are the time points these are the different firing rates or rather that's the way we have to organize the data and one thing that you may wonder about what are we supposed to do with all those conditions you know the stimulus frequency the decisions etc and so here's the way we're going to handle it we really have to think of these data points as point in that space that just have different labels so we can imagine that we sort the matrix in the following way here are the neurons there will be something like 370 neurons and then we take condition number one so condition one would be stimulus frequency one is 10 hertz and the decision is yes and then we paste in the time series so the PSTHs we computed for that condition okay for all neurons basically up to here and then afterwards we just take condition number two so you know stimulus frequency F1 decision no okay F1 is 10 hertz and then decision is no and we paste that in and then condition 3 etc because that's how we construct the matrix because the key thing is to think of these vectors as the firing rate vector and each firing rate vector just has a label the label is you know when did it occur in the trial what was the frequency and what was the decision in terms of PCA the order here doesn't matter one bit because you know that's just these are just labels on the dots that you had in this two-dimensional space so we have this 2D space here PCA doesn't care about you know whether this was the first point or this was the first point that doesn't matter at all okay so which way these points are sorted and that matrix doesn't matter the only thing that matters is that these are points so that you know the first coordinate was minus 2 and the second coordinate was minus 1.5 in this case okay that's the thing that matters so that's the way you have to think about the conditions in this in the task from a little for Rommel the conditions are just labels on your points you have to keep track of you know what the conditions are in your matrix but for PCA it doesn't matter okay this is your matrix here you have your conditions time, frequency decision for each point you have time, frequency and decision whichever way you scramble the matrix doesn't matter it's going to give you exactly the same output in terms of PCA because all the covariance matrices are going to be exactly the same and so the exercise is now the following so first you load the MATLAB file that has all the PSD-ages from Rommel for Rommel's data okay so some of you have managed to do a single PSD-age now imagine that you had all of the PSD-ages and then they're sorted in the following way there's a cell array so there's three dimension that has number of neurons times number of conditions times number of time points and you are now supposed to reshape that array and you have as a function reshape and I assume python has some similar function called reshape exactly and you reshape it into a matrix of neurons times conditions times time points then once you have this matrix you know first thing you should probably do is just plot it just to make sure you know you understand what you did so you can just plot it using for instance image as C as a false color plot okay there's one quick way of plotting a matrix a false color plot then you're supposed to center the data so I'm not going to give you any more details on that but you're supposed to take this matrix X and center the data then you're supposed to compute and plot the covariance matrix then you determine the eigenvalues and eigenvectors of this matrix then you plot the eigenvalues and then you plot the first principle component okay and you have until six o'clock