 Today I will be talking about unsupervised learning using dimensionality reduction and we're going to approach it from two different matrix decomposition techniques, namely principal component analysis and non-negative matrix factorization. And in the second part we're going to address it in the context of a chemistry example. So I'd like to start off by, for those of you who are veterans of this series, we're approaching this from machine learning by learning lots from data. As we've seen, we've been able to use cyber infrastructures, data repositories to develop predictive models and to conduct things like classification or various tasks. In this case we're going to address dimensionality reduction where we have an end dimension, our data lies in end dimensional space, maybe it's too difficult to be able to post-process or interpret, and we're going to try to reduce that down to a usable number of features that we can extract out. In this case we're going to do it in the context of reactive, decomposition of reactive materials. And again all these, all this is to help guide the design of certain experiments that you can do next. So as the outline we're going to go over dimensionality reduction looking at both principal component analysis and non-negative matrix factorization, and again address it in the context of a chemistry model. So I'd like to first start out by kind of dividing machine learning into two branches, namely the supervised and unsupervised branch. In supervised learning this is where we take inputs, pass it through a model, and our model will then help us predict outputs. If you've been a participant in our previous sessions you'll know that we've been able to develop linear regression models where we take in things like melting temperature to output Young's modulus or predict it. The second one, second part is the classification where we can do things like reading in different features of specific elements, say ionic conductivity or coefficient of thermal expansion, and then we can classify our elements into different crystal structures. In this session we'll be approaching, we'll be looking at unsupervised learning where we take in all our data that we've been able to capture, and we want to extract out certain features that stand out, and there are two sub-branches that lie under this category. The first one is clustering where you can see an example, this is an example of k-means clustering algorithm. They've been able to break up images of say the top row where we can extract out certain eyes or we can group things based on the hair or the hat, in the bottom row things like the road or the sky or the clouds. What we're going to focus on is dimensionality reduction, where we have a higher dimensional space, say this original image of the number 8 can be described by 784 components. We want to remove all the extraneous information, maybe some of the components don't really add anything, the ones we want to keep are say edges or vertices. And by reducing the number of components can we still retain the original image or explain most of what's happening in the original image by doing this dimensionality reduction, and this is very powerful as we can then post-process more data or do take this information even further. This is just to set up the motivation behind why we care to do dimensionality reduction. I'd like to start off with a very simple example of dimensionality reduction and kind of tie it into the algorithmic approach to principle code analysis. Some of you may be familiar with this, but this is a simple molecule defined by three atoms and it's just moving in time. So each of the atoms have their own motion, again three-dimensional so we have three n data that we have to take care of and it's evolving over time. We want to see if we can create a linear combination of unconcerned and uncorrelated motions from the time evolution of the left image. So what we know is we have a matrix of the positions, three n entries coming up with nine n being the number of atoms in our system and they evolve over time and this is how we can do a decomposition into again a linear combination of uncorrelated individual concerted motions. We can do so by defining or creating the covariance matrix. Essentially what the covariance matrix tells us is how one feature is correlated to another feature. What we want to do is actually decorrelate them. We want to have a linear combination of uncorrelated features. And so we do that by diagonalizing the said covariance matrix and by diagonalizing what we're trying to do is remove the off diagonal terms of our n by n covariance matrix. And this will help decorrelate the problem. And if you might have realized that this is simply a normal modes analysis, if you remember from chemistry or physics, and by diagonalizing said covariance matrix we can obtain two outputs, namely the eigenvectors and the eigenvalues. The eigenvectors are related to the normal modes. These are the concerted motions of the individual features and the eigenvalues relate to the frequencies. So it's just a simple example of how we can do dimension i reduction and this kind of follows the approach for principal point analysis, which I will put here. So we'll do PCA in terms of feature extraction for facial expressions. The algorithmic approach is shown here where we compute the mean of each feature, develop the covariance matrix using this equation, and then compute the eigenvectors in corresponding eigenvalues by diagonalizing said covariance matrix. In terms of the feature extraction, what you can see is that the eigenvectors actually give us useful information about what we've done. Remember what we want to do is define our overall data as a linear combination of uncorrelated eigenvectors. You can see that the first feature describes something to do with our face. It describes the most. We can see that it goes from dark to light in terms of shading. In this example here, this reference, and that in the second reference you can see that again we're able to extract out certain features. It puts together the eyes, nose, mouth, so on and so forth. So what I want to take away from this page is that eigenvectors encode features of our original data and that when you do a linear combination of them, you should recover your original data. So let's actually start going into the code. So I'd like you all to go to this link here, open up your browser, go to nanohub.org slash tools and there should be dimensionality reduction matrix composition. If you do so, you should be able to come to this screen here. I'll wait a couple seconds for everyone to catch up. So if you're able to follow along and put this in the web browser, you can click launch tools and it should take you to the landing page, which I have open right here. So what we're going to do first is address the example of principal code analysis. This will be our first tutorial. If you click on this, you should go and open up this notebook here. So we're going to explore principal component analysis by actually visualizing what it's doing. So if you follow along with me, full screen it. I want to start off by saying that these, if you have printed out the handout, these cells may look a little bit different than what you have there. And this is just for the purposes demonstrations of this demo, just to give more of a background and to provide more annotations to what we're actually doing with the code. The output should be the exact same, but for this demonstration, I just wanted to annotate more so you have a further understanding of what we're actually trying to do. So again, the PCA, what we're trying to do is emphasize variations or extract out correlations in our data set. In PCA, what I can do is actually help de-correlate our data. So in this toy example, we're going to have our two input variables. Of course, this algorithm can be used for more or higher dimensional and dimensions, but in this case, we're just going to approach it from two inputs. So in this example, we have a population where we know that there's some correlation between height and weights of individuals in our population. These are our two input variables. So what I want you to do is imagine sampling or even querying from 500 different people to obtain their information on heights and weights. And what we want to do is normalize this information from assuming a standard normal distribution. So these lines of codes are going to set up this normalize data already, but I just wanted to give you the framework for what we're actually trying to do. So we're going to eventually plot normalized heights versus normalized weights that has already been standardized. So first, we're going to import two packages. The first one is NumPy. It's a very popular computing package in Python. And the second one is Matplotlib, which is, again, standard visualization. So in this first line of code, maybe some of you are not familiar with it, but what we're doing is we're calling the NumPy module. We're going to load the random class and we're going to define a random state. You can read about what it does here. And if you're interested, you can go to the documentation to read up more about it. But all we are trying to do is develop random number generator so that we can use this random number generator we defined to draw from a variety of different probability distributions. And that'll be shown below. We're just defining a random state. From here, we're going to find two different arrays. Our first array is of shape two by two, two rows, two columns. And from here, we're going to again call our random class and using said random state we defined as one. We're going to pull from, we're going to randomly sample from a uniform distribution over the range, zero inclusive, one exclusive. That's our first two by two array. The second array we're going to define, which we're going to push to our variable B, is we're going to have a two by 500 array now. And instead of pulling from a uniform distribution, we're going to now pull from a standard normal distribution that's defined by the N. That's what it says here. Then what we're going to do with these two arrays is we're going to dot product. Namely, if you remember your matrix algebra, when we do dot product of A comma B, we're going to eliminate these variables and we're going to end up with a two by 500 array. And what we're going to do is do the transpose of that. That's the capital T. And at the end of the day, we're going to be left with 500 one by two arrays. And so if we run the code cell, again to run this, you can do shift, hold shift and hit enter, or you can hit the run button. If we run this code cell, you should see that I've shown the first 10 entries of our data, which is 10 one by two arrays. So this is our normalized heights and weights going from say a random sample that we've defined our 500 population, normalized heights, normalized weights. And so what we're going to do is we're going to then call matplotlib and we're going to produce a scatter plot of our normalized heights column and normalized weights column. And we're going to set up some color scheme for the grid, make sure the axes are normal just for visualization aspects. It's all status. And so what you can see here is that in this problem, what we can see is that our normalized heights and normalized weights has some sort of linear correlation. You could say there is some relationship between the normalized heights and normalized weights of our two inputs. Now, this problem was set up in such a way such that we don't see any points in the top left or bottom right corner. That's because PCA pulls from linear algebra. And so it's really not able to handle non linearly distributed data. And it's also not good at non multi Gaussian data distribution. So as we frame this question to succeed, we want to be able to show is that if our PCA algorithm can learn this linear relationship between normalized heights and normalized weights. So let's move on to the second cell. We're going to be able to try to understand what the PCA algorithm is doing by actually interpreting what it's outputting. So first, we're going to call the PCA class. And this is from the scikit-learn.decomposition module. If you would like, you can look up more on it. It has a lot more information than what we'll go over, which is PCA and NMF. But we're going to use this module for the purposes of this workshop. And so we define our PCA object and it has a variety of arguments. What we're going to define is that we're going to say, let's have let's explain our data using only two components. Very easy. It's two-dimensional. So we're going to have two components and we're going to pass it to a variable. And what we're going to do is we're going to fit our object to the data. Fit the PCA model to the asset X as defined above with set argument and components being two. And we're going to print out what the object is actually doing. So if you run that, you can see that what we have as an output is we printed out the PCA object, a bunch of arguments that we haven't touched. The only one that we're interested in is we modified this argument to be two components. Okay, great. What are we actually doing? What is the fit actually doing? And so what's interesting is that the fit can learn some quantities from the data. These are going to be defined below as components and explain variants. And I will go over it real quickly, but I think it'll be easier to visualize. And so as I stated, we're fitting our data to two components. And so each component is defined by a one by two array. So this will be the one by two array for component one, one by two array for component two. I'll go into actually what this value, what these actually mean in the next cell. But I like to move on to something that's more interpretable, which is the explain variants. Again, this is the explain variants for component one, and this is the explain variants for component two. What this is saying is that if we do a two component fit, component number one, however the algorithm imputes that, is able to explain a little bit over 75% of all the variants in our data. And component number two can explain just shy of 2% of the variants in our data. And just for completeness, I've done a simple sum of them. And what we're saying is that if we do a two component fit to our data, the two components in total can explain 77% of all the variants as defined in our data. And so what I want to, we'll actually understand and visualize what this is doing. Hopefully you'll get a better understanding of the components explain variants in a more visual approach. What we're going to do, take it from me here, is that we're going to use the components to define the direction of a vector. This is a vector of two dimensions. And we're going to use the explain variants to somehow correlate that to the magnitude of the vector. So the components are going to find our vector in terms of the direction. And the component variants is going to explain the magnitude or the length of the vector. So that's what this line of code, this code cell is going to be doing. We're going to go through the code and we're going to create an arrow based on what we read in. And so this defines a definition to draw the vector. All we're doing here is defining dictionary. This is just the style of the area that we want to visualize. This is our arrow style, the width of the lines, the color of the lines. And then we're going to use this command to just put the arrow in the figure with the defined style. Arrow props is the arrow prop here as defined. Again, we're going to plot our scatter of our original data, normalize heights, normalize weights. And this just sets the transparency to 80%. So it'll be very light color. And then we're going to loop over our explain variants array and our components array. And we're going to define a vector v and just draw the vector v on top of it. So again, we're going to use the components and explain variances of our fit defined above. And we're going to overlay that on our data to actually see what it's saying. So if you run the cell again, hit run here or shift and enter. You'll see that we're actually able to visualize our data. Again, light blue circles shaded, that's our original data. But then we've projected, we've been able to develop our fit for the PCA and project it onto our original data space of normalized heights and normalized weights. What you'll see are two vectors, this being component one and component two. How is that? Well, because I remember component one explained the most variance, over 75% of our data, therefore it should have the longest arm. And if you look at the actual entries to the matrix of our component one, you'll see that it kind of has the value related to the slope. So both of them are positive, positive. And in the second component, explain variance is a lot smaller and this has a negative positive value in the explain the components. And so what you can see is that when we do principle cone analysis, it creates the principle axis along which we're projecting our vector on. And it ranks it such that the most important, all the variance that is explained by the most important component comes first. That's principle component one and this is principle component two. And it shows how important we can project our data in terms of the variance onto these principle axes, one and two. Okay, so now that you understand or get a feel for how the PCA algorithm works, again it explains the variance, the overall variance of our data that's most explained by two components because we did a two component fit. Let's go down to actually reducing dimensions. Again, we can only really go down from two dimensions to one dimension, but I think it'll give you an idea of what's actually happening when we do a dimensionality reduction using PCA. So again, we're going to develop the model. In this case, we're actually going to now define our object with only one component, right? Going down from two to one is all we can do. And we're going to call our object that we defined as this variable. We're going to call the object and we're going to do a fit underscore transform. Now this might seem a little bit different from what we did previously. Remember previously we did something that's a Kim to step one. We've taken our object and we just apply to fit. So if I run this code cell, remember shift enter, we just define the object, which is doing, again, in this case, we have our argument as one component. So that's what we define our new object as one component. That's all we've done in our fit. What we want to do is we actually want to manipulate our data. We want to actually show the output of our data after we've done the reduction from two dimensions to one dimensions. So we have to do that by using calling the transform option in this object. And if we call this transform function, you can see that we've now gone from a 500, one by two arrays to 500, one by one arrays, right? There's only one value that's obtained because that is the single component we defined. All this is doing is we're doing fit and transform in one step. That's what fit underscore transform does. It's doing one step. So now we've actually reduced three dimensions. Let's actually see what it's done in terms of transferring data. You can see that we started out originally, remember, 500 by two data, normalize sites, normalize weights. Now we've done a transformation to drop it down to one component, 500 by one, whatever this one component is defined by the PC algorithm. You can see that this component is defined by only a single one by two array. This is the component for the principal component that explains the most variance in our data. And again, this has an explained variance of 75%. So while dropping from two dimensions to one dimensions, we can still describe over 75% of all the variance in our data. Pretty impressive. Okay, so now that we have our transform data, this is again in principal component space. It doesn't really mean anything. We've just projected our data to principal component space. So how are we actually going to be able to visualize it and compare it against our original data? Well, this requires us to transform it back into the original space. This requires us to call the specific function inverse underscore transform. Remember, XPCA, which is our variable, is just what we've done to do a fit in transformer data into principal component space. What we need to do now is do the inverse transform and put it back into our original space. And that's all we're doing. We're going to put our fit back into the lower dimensional space. That's 1D. We need to map it back into our R2 space. Again, normalized heights, normalized weights. So we can plot it against original data. You can uncomment this print statement if you would like to see how that actually what the outputs will be. But we're just going to plot and plot the scatter points of normalized heights, normalized weights. Again, setting the transparency to 80%. And then we're going to plot our new data again because we've mapped it back to our original space. We have two dimensions. So if you run this cell, what you'll see is that our original data again is in lightly circled blue. And that we now have our new transform data that we've reduced from two dimensions to one dimensions. And it's all projected onto this single principal axis. The one principal component that if you remember, the vector looked like it had its initial point here and it was pointing this direction. This is the original vector projection. So what we've done is we've just mapped all the variants of our second component. We've gotten rid of it and we've projected it onto principal component that explains the most variants in this case because we only have one component. This is what going from two dimensions to one dimension actually does in PCA. So hopefully it gives you an understanding of how we can actually, how the PCA algorithm works. And we'll actually see it in the second example, which is our chemistry model. So with that, I will jump back to the PowerPoint presentation. So as I stated, we've done PCA dimension reduction where we took our two-dimensional data, which was in normalized heights, normalized weights. We applied PCA and we projected onto one dimensions to put it back into our original space. And so as you can see, we're actually able to successfully learn linear correlation. Of course, this is all set up so the example would succeed. So let's actually put it in a real example, which is our chemistry model. Before that, I'd like to go over, of course, what NMF does because we're going to call our second method. Now we're comparing PCA and non-negative matrix vectorization. In the interest of time, if you'd like to go to this paper here, it defines both the methods, but we're going to kind of provide a review of what the paper actually does. So in this paper, the authors have an original image of a face. They wanted to define it. They want to see how PCA and NMF are able to recover or what it actually does when it does the combination to create the face based on the dimensionality reduction algorithm. So we'll first start with PCA. What you can see is that PCA, which is the product of two matrices, can recover the original image. Namely, the individual matrices are the principal components or the bases and the weights matrix. And you can see that they took the principal component of this object here. When you do the dot product of the weights, you can recover the original image. What you'll notice is that their weights matrix, remember this is a linear combination and all they're doing is diagonalizing the covariance matrix, will allow for positive and negative values in their weights, red being negative. So if you think about it for a minute, that doesn't really lend itself to interpretability. What does a negative feature actually mean? Some of these you can see have red eyes for these faces or red cheeks. That doesn't really mean anything physically for this eigenvector. So that's not very interpretable. So what they then did was move on to non-negative matrix factorization as an alternative method to do this decomposition process. And what you see is something very interesting. Again, we have our components or our bases multiplied dot product with our weights matrix to recover something that's very similar or approximate to the original face image. What you'll see is that we don't have any negatives, right? All the values in our components and weights are white or black, namely zero or greater than zero. And now we're actually even able for this eigenvector to actually extract out a specific feature, which is eyebrows. And this is very interpretable. So this lends itself to why NMF can be more interpretable than PCA. Let's move on to our chemistry example. So in the chemistry example, what we have here is complex chemistry occurring in our system where there's interactions with atoms in its nearest neighbors, things are bonding, things are breaking, chemistry is happening, our system is decomposing, it's a reactive material. What we want to be able to do is to deduce a reduced-order kinetics model from set simulation or data that you captured. In this case, if you go back to basic chemistry, we want to just simple curves that explains the decomposition process of our complex model. In basic chemistry, we can say is that we have reactants as our first curve, evolving into intermediates, and finally, products. Three simple curves that explains the overall time-dependent behavior of the chemistry happening in our system over time. And with this information, we can then do fits to develop rate equations. And that'll be important because then we can pass this information into say mesoscale or continuum-level models for them to take it a step further. So you can see the importance of dimensionality reduction in this context of the chemistry example. So how do we do so? Well, a naive approach would be to look at all the atoms. We have their information about positions, we have information about their nearest neighbors, what they're bonded to, so on and so forth. We can track all that information evolving over time. You can see that that will vastly blow up as our system grows, number of atoms grows, or the number of time frames we have to track grows. This is a higher dimensional space which we're not going to easily be able to track. So how can we use dimensionality to our advantage? Well, we notice that we can decompose our overall chemistry into a finite number, 280, different bonding environments that explains the overall decomposition process of our system. With this finite number, this gives us a cap on the number of features we have to track over time, 280. And what you see very interesting is that there are certain similar shaped curves to what we would like to have as an output, curves that go down, namely similar to v1, curves that come up and go down, which look like v2, and curves that come up and stay, which look like v3, our products. So this is what we're going to do with our example, using intuition about chemical decomposition products from very simple chemistry. We can map or project our overall complex chemistry into a usable format for which we can then apply the PCA or NNMF algorithm to. So let's get started. So I would like all of you to go back to the landing page. I hope you didn't close it. But if you did, go back to the link and open up the second notebook, PCA and NMF. I'll use it here. So back to the landing page, go to the second notebook, PCA and NMF. And if you click on it, it should open up this notebook. So again, the overview just reviews what I stated in the previous slide about the overall chemistry. PCA and NMF, I'll address this later on, but you're more than free to read it on your own time. Let's get started with the first code cell. What I want to show you here is how complex our data is. So if you run the cell, remember again, shift, hold shift, hit enter, or click the run button at the top. It should print out kind of the format of our data table. Part of the codes, we're going to use the pandas library. And we're going to do read CSV as our function, reading in the data file that I defined. And we're going to skip some rows as the header, and then we're going to show five rows. So this kind of shows you the format of how our data is defined. And this shows you the first rows in information. Again, all this is a show that's very complex and you're more than welcome to look at it on your own time. What we're going to want to do is this code is going to take about 40 seconds to run. Let's just run it, and then I'll go over kind of the breakdown of code. So shift enter, should run, go to the very top. And so what we're doing is we're essentially trying to manipulate the information I just showed previously in a more readable format, where we can actually understand and capture the number of the bonding environments. Remember, I stated there were 280 for each of the atoms that it experiences over time. So you might ask yourself, how did you come up with the number 280? What does that actually mean? Well, we know that for CHNO atom types, this is organic reactive material. There is only a finite number of combinations. And we're going to go back to basic chemistry and assume that we have a maximum of four bonds. Shown here are various, using intuition, various molecules that you can look at that we know that exist in our decomposition process, just shown as an example of the different bonding environments we can have. So what we're going to do is import numpy. We're going to call all the functions in norpy. We're going to get sidepy, mat, sys, and read. This is just for a string comparison. Now we're going to define our arrays. Just going over basic code for those who are not familiar with Python. What we're going to do is first define our function here, where we're going to call a variable ifile, and what we're doing is we're just opening the input file I defined above. And we're going to read what's in the file. Now we're going to loop over all the lines in our file. Now we're going to start splitting or delimiting on white space. This is the white space character. And if the second index is equivalent, exactly equivalent to the string time step, then we're going to pass the third index to this variable called tstart, and we're going to make an integer. So you might ask, why is this the second index? Why is this the third index? In Python, we start counting at zero. Simple as that. So that applies for loops, applies for conditionals, applies for arrays, applies for strings. So after we reach the point of looping through, also our second conditional will say, hey, look for a number. How many atoms do we have in our system? And we're just going to exit out of the loop. And then we're going to close our file. So doing so, we're going to move on to this code where we're just essentially going to, this is a very single line way of counting the number of lines in our file. It's not cheap by any means, but you can compress it into single line. We're going to open our file, start counting up the number of lines by summing, and this will pass it here in the lines. So if we go down at the very bottom, you'll see that we've been able to complete one of the bond table. One thing I'd like to point out is that if you're doing what we're going to do in this notebook, which is writing to a file and then reading it immediately, sometimes the script can't catch up to what you've printed if you jumped immediately to the next code cell. So what we want to do is make sure you flush the data from the buffer. That's all this command is doing. It's very useful. So we'll move on to the second code cell, which is after we've done our post-processing in a meaningful way to be able to do some visualization, let's actually look at how we've done the post-processing. And I've shown you this image before in the slides. But again, we're looking at the 280 different bonding environments, evolving over time. We can see that there are three main features that we can extract out. So that gives an indication to say, hey, let's use end components being equal to three as our dimension-high-reduction number of components. So we'll do so. First, we'll run PCA as we've seen that previously. Again, remember we have to develop and create our model. So we'll import the PCA object from the SK Learned Decomposition Library. We're setting number of components to be equal to three. And then if you remember previously, we've done a lot. We only defined the number of components as three. In this example, I'm just defining a lot more other arguments for an object just so you can actually reproduce it, right? One of them is a random state. If you do a different random state, you're going to have probably a different solution you run every time. So we're defining our model, and then we're going to fit our model to the data. Remember, this is our 280 different bonding environments evolving over time. Those are the two dimensions over our array. And then again, printing out the amount explained variance and cumulative variance, and writing a bunch of other files. What we're showing here is what we're going to do is we're going to create our model, and we're going to do that fit, underscore, and transform. Remember that two-in-one process. And then we're also going to show our components. So shift, enter. What we're going to be able to see is that, again, three-component fit. Using PCA, we can explain 93.3% of all the variance of our 280 features evolving over time by just the first component itself. The second component gives about a little bit over 5%, and the third component can only give 1% worth of explaining the variance of our overall data. And this provides us with a cumulative variance of 99.6%. That's very impressive. We're able to retain most of the variance explained by our data with only a three-component fit. This is great. So now let's actually visualize what the components themselves are doing. So if you run this code cell, you'll see that we have our three components. Component 1 explaining the most variance, Component 2 explaining the second most variance, and Component 3 explaining the least variance of our three components. Now, you might think to yourself, hmm, this doesn't really look like the concentration profiles I'm used to understanding. Concentration is probably, it's always zero or positive. There's no way you can have a negative concentration explaining our data. So you might say to yourself, since concentration can't be negative, these immediate results are probably not interpretable. You might come up with a fancy way of manipulating the data, but for all intents and purposes in this demonstration, we're going to say that it's probably not interpretable. These outputs are not interpretable enough for what we want, which is concentration profiles to explain the basic chemistry decomposition process. Now we're going to move on to non-negative matrix factorization and showcase the power of what it can do. So, we're going to, again, start off with NMF and develop the model, a very similar PCA. Now we're just importing the NMF object from sklearn.decomposition class. We're going to, again, define our number of components to be three. We're going to set up our model such that we have, again, three components using the NMF object. Now we're going to initialize exactly the same as PCA so that this is a one-to-one comparison of what we've done previously. Again, we'll do fit underscore transform for our model and we'll define the number of components. If you do ship enter, you should see that we've completed doing our NMF fit. Let's actually visualize what the fit is doing. You think we'll get something similar PCA. So now we've done the fit. You can see that now it seems to be a bit more interpretable. Now, our constraint NMF is that our values in our array that we decompose have to be greater than or equal to zero. This is a constraint that you see here and it looks more interpretable. You can see that this looks very similar to our reactance curves going down. This looks very similar to our intermediates curve coming up and going down. And finally, we have our products curve coming up and staying up. However, you notice that unlike concentration, this isn't set in the range of zero to one or zero to 100%. We're going to have to do some normalization on this data. So this code cell, all it does is it does the normalization aspect and we'll plot it. And now you see we have something that's a lot more interpretable in terms of our final goal. Remember, our final goal is to get three curves that essentially decomposes our data down, three simple curves that can explain the overall decomposition process of our system, namely the red component one, equal to reactance, blue curve, component number two, equal to intermediates, and green curve, component number three, meaning that as products. And so we've actually been able to visualize this and we see that it follows a very similar scheme on here. So with that, I'm going to jump back over to the slides. Now we've again applied dimension reduction using NMF or NNMF, non-negative matrix vectorization. And what we can then do is now that we have these concentration profiles, we can go back to our chemical kinetics from chemistry and fit them. Next equations. And what you'll see is that we've done that for ensemble temperatures, different initial temperatures, and we can get overall kinetics parameters. And then again our end goal is to do this dimension reduction to get kinetics parameters that are passed into continuum or mesoscale models and for their benefit. So we've been able to reduce our end-dimensional data into three easy curves to get values to pass further along. So I'd like to go into the non-negative matrix factorization algorithm a little bit to give you a flavor of what's happening. So when we do our factorization, what we have is our matrix V. This is our encoded matrix. And we want to factorize it into a dot product or approximately equal to, as best we can, the dot product to some matrices, W, our features matrix, and H, our coefficients or weights matrix. What you'll usually notice is that in our decomposition process, we're going to have to define a new variable P, say our number of components as we've done, two or three in the previous examples. And this will usually be a smaller dimension than our M by N encoded matrix. This allows us to be able to reduce the number of features and only look at the most important ones. So what makes it non-negative? Well, the non-negative aspect of this matrix factorization technique comes from the fact that we have to constrain both our features matrix and coefficients matrix being non-negative as in the entries or the values inside our W and H matrix all have to be greater than or equal to zero. And of course that means that our encoded matrix itself has to be non-negative. So a common approach to actually doing and applying this constraint is to minimize this error function, which is the difference between our encoded matrix and what we would like our features matrix times coefficient matrix to be. Doing this difference, minimizing the Frobenius norm. If you're interested in pseudo-code and how we're actually able to update W and H over the number of iterations in this minimization scheme, I would urge you to please look at this paper here. They go into very great detail about the pseudo-code and actually obtain W and H. So I'd like to summarize PCA and MF overall. So with principle component analysis, we saw that in the example of the facial features we're able to get both positive and negative weights associated with the face. And the constraint on this is that we ensure the eigenvectors are orthogonal one another. Remember, it's a linear combination of uncorrelated features. What this does is though is that with this constraint it does a global transformation, right? We had some of the features that had both noses and eyes and mouth. So it's a more global transformation. And each of the components are defined by the largest variance, or largest explained variance that that component can with respect to the original data space. So each component explains a certain amount of variance of the overall data and it's, they're ranked by component number one explaining the largest variance. When we go to non-negative matrix factorization it's more specialized because what we're allowing is only added to combinations. Remember, the coefficients, both the W matrix and the H matrix has to be, has to contain non-negative values. So this is a hard constraint that makes it non-negative. And because of this constraint we now have weighted parts face features. Remember again back to the faces that the matrix for the coefficients or the weights was black or white. That means the values are either zero and this means that when you do, when you sum them back up again to recreate or recover the original image it was very part space. We were actually able to extract out specifically the eyebrows or the nose unlike PCA which kind of mixes the features together to do the linear combination. And so in this case, NMF actually provides more interpolate outputs in the examples that we've covered here. So with that, I'd like to thank you for sticking around for the tutorial going through Dimension I reduction and going through principal phone analysis and NMF. And I'd like to bring you back to kind of the whole grand scheme of what we're trying to do. Remember, we're learning from data. We have cyber infrastructure, repositories and cloud computing to help us assist in this. We've done, we've created predictive models, done classification in this specific session. We've looked in great detail at Dimension I reduction both PCA and NMF and all this has helped guide in the design of experiments. So with that, thank you for your time. And at this moment, if you have any questions you may unmute yourselves. Feel free to ask with remaining time left in the session. So thank you. Thanks, Michael. So I'd like to ask everyone if you could unmute your mics and give some appreciation to our speaker. And we can start with the Q&A. Okay, so I'll start with some of the questions from the chat. So one of the questions that we saved is how does one interpret the reduced features obtained? Well, is that how does one interpret the... Reduced features? Like the components? Are they asking in the context of PCA or NMF? I believe that would be PCA. So with the reduced features, that would be the eigenvector matrix, right? So this is the features matrix, not the coefficients or weights matrix, just to clarify. So for the features matrix, what this is essentially saying is this is the eigenvectors in our decomposition. In terms of the faces, this is what we're able to, the algorithm breaks down to predict the most variance that explains kind of the eyes, ears, nose, whatever in our overall image. I'll use that as an example. That's probably easiest to understand. Let me go back to this example. What you can see is that our original image is a face. The eigenvectors define kind of the amount of variance or how much information is breaking down pixel by pixel can be explained in this decomposition process. So you can see that the latest eigenvectors really don't explain much. In this case, it's just explaining dark to light shading and doesn't add anything, whereas the highest eigenvector explains the variance. So in my interpretation of this, it's getting encoded information for the features by essentially doing a linear combination by breaking down to linear combinations of uncorrelated information. Maybe I'm just regurgitating what I said along the slides, but this essentially just provides information about certain features altogether. All it's trying to do is solve the eigenvalue, eigenvector problem by doing this decomposition process, diagonalize the covariance matrix. So it wants to de-correlate each of the individual features or, say, certain colored pixels in this case, pixels for the eyes and pixels for the nose. You can see there's slight differences in colors. I'm trying to break that down as best as I can by the algorithm, and it just says, this is what we can get for the feature number one, something to do like that. So maybe that didn't really answer your question, but it essentially tries to extract out certain features but ensure that we're applying this constraint to our system, which is each feature itself needs to be eigenorthogonal to one another. Maybe someone else in the chat who's done PCA can also provide some input if they have... Do you have any comments on that? Okay, we'll see if there's any follow-up if you need clarification for your explanation, but let's continue. So for PCA, do we assume that the inputs would follow a Gaussian or normal distribution? Yes, I believe so. As I stated previously, we have... I forget where I put it, but in the slides, it has to be... This method works best assuming a linear combination. So it's pulling from linear algebra with this matrix decomposition technique. So it's best if the data itself is linear correlated. In terms of, I think you said Gaussian, that would also help. There are other techniques that I haven't covered here, which rely on PCA method, but it can be used for data that may not be linear correlated, is the best example I can have. These are things called kernel PCA. If you can go back to the scikit-learn decomposition website, you can look up things like kernel PCA. This is very useful for data if it's in, say, a ring, or there's a large spread, almost looks like a target. It can apply the PCA algorithm to this multivariate data space that may not have linear correlation. So that's maybe addressing the question that's another constraint of PCA that I found online. And of course, there are a lot of resources that explain the downfalls to PCA. But I thought I'd frame it in examples that people will use in this, which is facial recognition. PCA is very prevalent there, and you can do a quick Google search for different examples. Okay, so another question. In the scheme of machine learning, going from data to predictions, where does the dimension reduction fall? So dimensionality reduction will help if you have so much data that you can't even make a prediction, right? So say, for example, you have data in, like, a 10th dimensional space. Maybe that's interpretable for you. So you don't have to do dimensionality reduction. Let's say you have data that's in, like, the 1,000th or even a millionth dimension, things that you don't even know that there's information there, right? This is feature extraction or abstraction. We're trying to just pull out comments or abstractions. We're trying to just pull out commonalities in our data. We'll pull out things where we're fogging with another. If you have too much information or you want to try to find patterns in your data, it's very useful to do this to pull out those patterns that may be then something that you can say, oh, hey, look, there's actually a relationship between Fed feature one and feature two. Let's put this in a model. So I would say that it's very useful for your data if it's in a hierarchical space that's too complex to even post-model. I would also add that you can use this data at the back end as well. You don't have to do it at the very front, but you can do it afterwards. If you run it on a model and you obtain some outputs, maybe then it's also still not interpretable. So you can then also apply dimensionality reduction to make it into a meaningful interpretable results. So it can be used essentially anywhere, I think. Okay. Another question. How can you find redundancy in your remaining components? How can you find redundancy? That's a really good question because it's kind of the information that you throw out. That's a really good question, I think, though. That's some more feedback, but I would say that I haven't really addressed redundancy, I think. In this case, again, we're trying to break down our data into so many eigenvectors, in this case, for PSDA, just to recover the original image or the information that got discarded we're not really concerned about. You can see that even in this example here, we've gone to the 500th eigenvector, and it's not really giving any more information. Outline of the face doesn't really do much. Maybe you can stop at, like, say, the 10th feature here or the 50th feature. In terms of redundant information, maybe that's useful information that I haven't thought about, and it's redundant. Maybe you can decompose that even more, but it's not something that I have looked into in further detail. Thank you for the question. It's very thought-provoking. I think this is the last one from the chat. How many components or how do you choose how many components should be used in both PCAN and non-negative matrix? With any reduction algorithm I would say that there's really no... It depends on your interpretation of what you want to do. In our case with MMF, putting in context, again, the chemistry model, it's very easy because we have intuition. We're pulling from chemistry and we know that this is what we have to do. That allowed us to say, hey, let's go with three components and that's what I'm going to fit with, by all means, we're able to obtain three component fit and it looks very close to chemistry. We could easily go into four components and that's currently in the works. We can develop models for that. I would say that with PCAN and MMF you would probably want to look at the cumulative explain variance. There's no real good metric for defying that. That's still a research question of interest, but I would say if you can explain say 99% of all the variance in PCA or you can project your data in MMF on to say five components and still explain each of those features overall, I would say you've retained most of the information, most of the original data that you've done by projecting down to this lower space. Again, that question about what is the correct number is still to be determined and is still a metric that people are looking into. When is enough enough? Sorry, I couldn't answer your question, but if you come up with the answer please do publish it off the community. Okay, another question. For numerical inputs and outputs can I identify significant input parameters from this decomposition? Input parameters. So what you're saying is like if I I wouldn't know if you can, from these examples it's very difficult if you can extract out input parameters I would say it's more of a method that I've shown is how to find commonalities for correlations and break apart these correlations in analysis but if I remember back to the PCA example, we went from normalize heights and weights instead of solid reduction, but it's not like we could say that normalize heights was what we should focus on those normalize weights or what we should focus on those we're able to project a dynamic to do a simpler dimension of 1.2d and 1.2d I think it can give you an idea of maybe what, how I should answer this is if you do this dimension reduction going from a higher order dimension to a lower dimensional space, it might be able to break, it might be able to dissociate itself from certain numbers of those dimensions, which are attributed to certain number of inputs as you were suggesting so maybe there's a way that it's able to know to collapse say going from 10 features to 3 features it could collapse maybe 4 of those dimensions and it projects it only to the remaining 3 dimensions as a linear combination of those 3 and you can eliminate the inputs essentially that explain the least amount of variance so in a way I think it's possible but I haven't shown it here so I don't want to claim anything inclusive Okay, can I identify local variances using PCA and MF rather than global variances? So I know PCA is a little technique I haven't looked further beyond than just the vanilla PCA I'm sure that there are other techniques, I think there's one called independent ID or something like independent decomposition that can probably do local comparisons. I know that MF is more local than PCA but again the constraint is that the data you read in has to be negative as you pause it I would urge you to look beyond PCA and MF this was just to give you an understanding of decomposition matrix decomposition I didn't really answer the question but I'm pretty sure there should be methods out there or any development or individual or independent decomposition Okay, I think you can see the last two questions in the chat I saw about tensor Yeah, what's the difference between non-negative and tensor factorization? So to my knowledge there is a paper that came out by a collaborator that did non-negative tensor factorization so I'll explain in that context what they were able to do is instead of non-negative factorization which you saw was a 280 by time matrix they broke that down into different they broke it down into different combinations I would say I have to look back carefully of the 280 over time they essentially applied it as a tensor instead of a matrix so they added a third dimension instead of a two dimensional object they had it as a three dimensions and they put it in tensor and they were able to essentially define different iterations of their data by applying this NMF I think they evolved over different seeds and they used this tensor factorization technique to to define a metric that best defined your chemistry scheme say for three components remember we did our random c to zero but you obviously get a different answer if you did random c to one or random c to the thousand so they're able to use tensor factorization to have a third dimension which is a different number of seeds and in doing so we're able to do decomposition to extract out the best of their approach I think this is one way of using tensor factorization over matrix factorization maybe your data is sensitive to the seed for some reason but to my knowledge that's how tensor factorization works and I know that there's some other algorithms that do tensor factorization again you can look up on scikit learn and I believe MATLAB has its own modules to import for matrix decomposition but to my knowledge that's how those collaborators use tensor factorization here's a quick question under what circumstances is PCA preferred over non-negative? I mean it's almost and that's all we're all aiming for I think for most data that has positive negatives if you can have a negative value in terms of a feature if you just want to have extract due to decomposition such that your outputs are orthogonal or linearly independent from one another then you should do so it's a pretty strong technique and MF just requires that the data has to be non-negative and it does this decomposition I believe that there is a technique out there or there's further approaches that actually combine both PCA and NMF so NMF in itself is not completely orthogonal there is some correlation to the vectors you can't say that component one is completely independent from code 2 doing NMF there are people who have combined both PCA and NMF to do both I think they do NMF first and then they apply the constraints of PCA to the orthogonality on top of your NMF data so there are things out there that can do both of them just haven't been implemented here but I would say the strength of PCA is that it ensures that your components are orthogonal to one another so there's no reason for correlation between the two okay I think that was it so again thanks everyone for coming if you can let's thank Michael again for this lecture and thanks for joining us for this the hands-on tutorials for machine learning it's been a pleasure thank you