 Okay, hello everyone. Hello one more time. So we today have Marcin Kosinski, who is a data scientist from Poland, is working for the company called Gradient, right, and he's today and tomorrow he's teaching with our summer school, teaching the class called Survival Analysis, but now the topic is different and Marcin is also a founder, right, of YR Foundation. So most of you who are working with Python always have this question for us like, YR, YR, YR. Now, so Marcin is the founder of that foundation, YR Foundation, and he's going to talk about, Jin. The segmentation without two senses, with the help of an MF. Thanks Haber for the kind introduction. Thank you guys for joining. I know it's not easy after the full day to have a concentration for one more one more staff after the work or after the school that you even attended maybe today. So my topic today is the segmentation without distances, and this seems to be more sexy than I previously invented the topic and Haber said it's not quite interesting and no I'm gonna come. So that's just marketing purposes, but the real topic is this stuff. So I'm gonna talk about the NMF, which is non-negative matrix factorization. I know it's disgusting, but it's very very beneficial, and it's a tool that I will use in the segmentation scenario, and the segmentation scenario has that complexity that it requires to work with high-dimensional feature space. And I know that high-dimension high-dimension of features is different for anyone. Like I know that biologists work with a lot of features and they have 20 000 of biomarkers. They work on a daily basis, and that's typical for them. But in the market research area where I work, where we in most cases verify the mindset of people with the survey opinions, we work with 50 questions, and that's high, that's much, because during the survey when respondents are feeling the survey they lost the concentration, 50 questions is too much, and in the end where we provide the solution, the recorded description of what we've just created, the 50 questions is too much for the CEO checking our reports. So I'm going to describe you how we've used the composition for the matrix, for the surveys that we provide in our company, on a real-life use case that we've done for the Facebook Foundation, and we detected the teachers and parents' mindsets on the education system. And who am I? Yeah, I'm an R enthusiast, I organize R users group in Poland, I was an active R blogger, and my goal for this year is the YR conference that I would like to strongly invite you, and in the end I've got some materials. And on a daily basis I work at Radiometrics for those who don't know yet, we work in the market research area, and we try to apply data science techniques. We've got a global setup, there are four people working in a company, and we've got offices in four countries, which means everyone works. In a different country we are working in a few time zones, so that's not always the easiest, and I'm really always very happy to talk to real people that can actually interact with me, so that's also some pleasant moment for me, not just sharing the knowledge but also meeting new people. So let's get back to the segmentation. So the definition so that we learn on the same page. I will talk about the market segmentation, and there is an assumption that on the market that's a group of customers, that's an audience of some people, like there is audience of teachers or parents that you would like to verify their mindset with a survey. So there is an assumption that the people that behave somehow as a customer or answers the questions as a respondent in the survey follow similar patterns, that you can find groups of patterns in the final survey, and that's how would you like to describe the population. Finding groups with similar patterns that actually summarizes the whole thing that is happening in the specific market or for the specific audience. So I've got the examples here, those are surveys, you can segment, that's the same term for clustering, the customers for the online retailer, you can do the segmentation for brand awareness, you can just verify whether the people know certain brands. You can also do the segmentation for the clients of the partner. Okay, why it's cool, why it's helpful, and why it's sexy, it answers many questions. So for example, how many groups are there in the market? What are their sizes? And what are the specific characteristics that differentiate segments? So you can imagine there's a survey or there's a database, a lot of features, and you've got those group assignments, but you'd like to know what are the differences between groups? So you'd like to know what are the drivers driving the differences in those groups? And also what's possible for that one, you can assign future customers or future respondents based on the segmentation to the segments. So you might survey a thousand people, and the moment you've got the answers for the same people, for those questions, you can assign them to segments. But that's typically a hard case, since you've got to do the same survey for those people. So during the segmentation, you can extract the most driving variables that differentiate groups, and having top five or top 10 statements, you can use those for the future assignments. So there is also a plus. You make a huge one survey, and then you take the most driving variables into the future survey so that you can map new surveys to the previous groups. So that's the coolest part that I have learned about so far. Yeah, and segmentation can answer the question, how you can improve your marketing methods. That's really cool, and a lot of customers ask about the staff. There are some challenges in the segmentation. So there are challenges related to the data, since you can work with different types of data. So that's my scenario. I work with the survey data. I have my own challenges. And for the methods that are used, that's the unsupervised learning. So the groups are not defined at the beginning. It's not like for the prediction where you have trained, tested, and labels. Often it's a mixture of very various types of data, categorical one, multi-select. Since during the survey, you can select a few answers for one question, and there's also a case for the audio data, which is like, do you agree with the statement from on a scale one to five, where five is strongly agreed and one is strongly disagree? So that's the audio data. Another challenge is the extremely huge, that's huge in the survey scenario, feature space, but you can apply those techniques also to different kinds of data. That's not only for surveys, that the NMI that I will describe also helps segment people in different markets, not only surveys. And the biggest challenge is actually to have the meaningful story. The proper sizes, you can't have one small and one big and nonexistent segments. They've got to have the description, how are they different from the other segments, and they've got to play the story altogether. So those are the challenges. Those are the challenges, and how you typically observe the data. So when verifying the mindset of teachers or parents on their view, on their education, you ask typical statements. Do you think the school is responsible for their education or do you think the parents are responsible for their education? Teachers and parents are asking that they actually disagree. We've got some demographic variables, the race, the age, the gender, and it's not just like you survey everyone, it's more targeted. So if you'd like to have an inference for the population, like let's say the whole United States population, then you also need to gather the data in the same format. So if there is specific ratio for females and males in the population, you would also like to have those in your data. You would also like to observe the same for the race. You would like to have specific amount of white people, Asian people, and that's the same for age packets. So that's not that straightforward, but final providers that you use to collect the sample are supporting this process. Okay, and in the typical segmentation, like after the presentation we've got a workshop where we can walk through, we see the data, apply traditional techniques for the clustering, and also apply the NMF strategy. What you do is you try to get the distance, the metric, that stands for similarity between respondents, and then you would like to have the distance matrix that presents the distance between observation, that's the distance between the first and the second respondent, first and the third, and you see that it's closer for the first one to the second one than to the third one. And based on that stuff, which is put on the heat map, you would like to create the segmentation. And most of the techniques work when the distance matrix is specified, and whites are applicable most of the times because there are so many, there are so many distances that you could pick from, and they have so different properties and are designed for specific cases that when you have a lot of features, some of the matrix doesn't work, they don't follow their properties, they love their properties, like Euclidean distance works in 3D, in 4D, but not in 50D. 50 variables for the Euclidean distance is too much. Some of the distances are just meant for categorical data, what if we have the mixture? And there's often a requirement for a feature selection, so you get 50 questions, but you would like to segment people, and at the same time, get the feature extraction, the main drivers. You don't need the noisy variables. And also the feature grouping, if they share the similar information, you would like to have some indication that few features are doing the same information. So that's why we are not using the traditional clustering techniques that we will also cover after the presentation, what we're using something different. Okay, and now we are here for the non-negative matrix factorization. And I'm really glad I can speak in English because in Polish, that the translation is really awkward. I don't know, is there any Armenian translation? We can try. Maybe, but enough. I've got the brainstorm being translated yesterday, and it was weird. In Armenian, right? In Armenian, right. How was it translated? Matagro. What's up? Matagro. Okay, let's keep it in English. So even for Polish, I like to keep English, English versions. Okay, all right. So yeah, that was a small job. Let's get back to the topic. So the NMS stands for the Full Family of Algorithms, and I'm gonna describe one version of the whole setup that you could use. So you start with the data, the title version, which has a respondents as rows and answers to the statements as columns. And what you would like to achieve is the decomposition to two other matrices. And what's really important, the one matrix has the same number of rows as the original one, and that stands for the respondents. And here we've got columns that stands for the features. And there is a hidden dimension called rank, factorization rank, and it also stands for number of groups. And what you would like to have from this composition and what is the reasoning, and what are the biggest benefits, is that this matrix stands for the segmentation. You've got people and assignments to the segments, and you pick the biggest one. Okay, if those are the numbers, zero and one, you take the second segment. If it's five and three, you take the first segment. So that's easy. And what's really important and what's really cool about that stuff is you've got the same four features. You've got features and the loadings on this hidden dimension. So for the first statement, that's called first feature, you also got numbers and the higher indicates that this feature is highly visible in that group. And it means that in that group, actually people are over or under indexing this question, which means they strongly disagree in comparison to the rest of the population or they strongly agree. So this gives the segment assignments and this stuff gives us what features describe which groups. So that's two dots with the one stone, two dots with pigeons with one stone. We've got the segment assignments and we've got the description. And during the workshop, we're gonna see how those look like. And there is one more requirement that is put on this decomposition. So it's not just the decomposition, but there is a requirement that those matrices should be sparse. And that means the more zeros, the better. And the more zeros, the better because you've got direct assignments. So there are many zeros and just numbers so you know that this person is assigned to this group and this feature explains this specific group. Okay. Do you think this might work? Anyone is convinced? Okay. I'm convinced this works. So your features are only the statements? Yeah, in this example, yeah, they are the statement. You could also add the demographic characteristics, but those are the categorical data. You should call them as dummy variables. And also you could use the continuous features, but they've got to be scaled on the same scale as the other ones. So there is no influence of the scale. On the factorization. So for numeric features, you can also learn some segmentation? You could use all the features. I am here using the statements for the explanatory reasons. But you could also have that, I don't know if there was a registered income for a person, you could use the income. I have a couple more questions. So in the end, are you going to have some interpretation for the segments? Or are you just going to use those two dates? Yeah, we're gonna have segment assignments, segment descriptions by the discovered variables. And maybe I can share some story of what was created. Can I have one question? Yeah. So in terms of the output, just the output, how different is this like from the traditional, I don't know, factor analysis or principal component analysis? Okay, yeah, that's an offensive question. So for the principal component analysis, which reduces the dimension of characteristics, it shows you the data in the new dimensions that where the size is smaller, but still the full information from the original data is covered in those new dimensions as their linear combination. And what we are getting here in comparison to what was the other option? Factor analysis, that's almost quite the same. Here you can have zeros for first segments here and zeros here and numbers here. So you stay with the same amount of features and you know that those explain this segment and there is some information and those explain this one. And in the PCA there is this hard combination to get back to. Okay, you are convinced? Yeah. Thank you. And you are convinced? That's gonna work? Okay, you are free of us. That's good. Looking forward. Okay, one math slide that I have. So if you are an R programmer, you might be interested in the only R snippet in this presentation. So that's really convenient. You run the NMF function from the NMF package. You provide the data. You input the rank and watch out. I haven't yet said how to determine the rank. So there are procedures to extract the rank. Then you specify the method because that's a group of things and you specify the seed. So what are the methods? What are the seeds? What's the purpose? So at the beginning we would like to find two matrices that minimize the cost function. And here we would like to have the multiplication of those to the composed matrices to be as close as possible to the original one. And we also got a special regularization part that puts the requirement on those two matrices to be sparse. Is it only in my head or it's happening? If you're acting to your voice, I guess. I think that's good. So it's never going to be very close because there is regularization and it's never going to be very sparse because it's got to be close. So there is some compromise. We've got the composition that had both of the features and then there is a compromise. And the method technique actually stands for how you optimize the function. What are the techniques used? Whether that's gradient descent or the one or other two or stuff. I'm not that smart to get these details. And also there is an option to specify the distance between those matrices. There's an example default some kind of probinate distance. And what is the seed for? So you start with those matrices and you input randomly selected values from the original data. And then you have the starting point. You've got the method for the optimization and you end the moment you reach the minimum or there's enough iterations that you require. So that's the random starting point which means you might end up with some ridiculous solutions. So what you typically do you repeat the stuff 30 times and in the end from this decomposition you take means. And it might happen that the segments are switched because of the random start. So you gotta reorder the thing. So they follow the same pattern like segment one in this decomposition is the same segment one in this decomposition based on the features and the rows they specify. And then you can take them in. So that was the hardest part. We got it covered. And now the use case. Okay so that was actually used in our research. So how the factors are You can go by gradient descent that's possible to go to negative numbers as well. So is it cut by zero or do you know it? Should. Should. Well okay. Never thought about that. Never thought about that. That it can be that far away. So you've got all the positive values in those matrices and somehow in the estimation you reached a negative value or a set. You can go to negative numbers by gradient descent. Okay. Okay. And one more question. You said you average out the result of many rounds. Yeah. What would be how do you end up the permutation of the? Yeah. So I just described. So you can have a two round or rank two and then have you permuted. So you can have cold suns rose suns to determine the order. By. I just want to understand. Yeah. That's that's silly but that's one of the solutions. And you do it on the two factor case because if you do it for more factors than the permutations. Yeah. There are more permutations. Yeah. So based on that you could get the order. For example, you've got one matrix for which you've got cold suns and you've got another one. You've got the cold suns. Okay. There's the order of some of the problems. Maybe it should be reordered to follow this order. And that's the order for two. We do the same. So check all the permutations. The software does it for you. You don't do that. Okay. There was one more. Yeah. I was wondering whether that averaging is really important. Like couldn't you just round the gradient descent for more iterations and then check that your loss over iterations doesn't suffer? You know, the starting point. It's always better to run multiple times. Okay. The real use case. So there was a survey that we made kind of 50 statements. Find reasonable number of groups for teachers' population with valuable sizes. So they can't be very small. Can't be very big. With meaningful description. So they've got to be teachers that are visible for in one group and not in the other groups. And it's good to be if it's built on a small number of features as possible. So in the final summary, if you can extract eight features that differentiate the groups and that's awesome. And what's the approach? For finding the number of groups we haven't talked about it yet. So you start those 30 runs for case that can be from 2 to 10. I mean the rank. And for every K that you detected two, three, four, five, six. So K is the number of the clusters? K is the number of clusters. The ranks. Yeah. Thanks. Thanks. You can then check the statistical goodness of fit for the segmentation and the decompositions. So that's the dispassion that stands for the reproducibility of the decomposition. The silhouette, that's a statistical measure that verifies how good a certain observation fit its segment in comparison to the closest one and you can take the mean for every observation and the sparseness for the decomposition. And that's how you can ensure that picking the right K actually was motivated by some statistical rules but also there are some business requirements that it can't be greater than 10 and it's rather shouldn't be free because that's small. And also it helps you to determine what's the proper what's the proper clustering segmentation and the big K is something called consensus clustering. So for this very close you can then find out whether the features are pointing to the same segment. So they are in the same segment and based on the third year runs you can have the coherent matches. How many times the same features were together in the same segment. And based on those you can then plot the consensus matrix it's gonna be better visible if I'm gonna show you the plot and based on those you know whether in random runs the features are pointing to the same group or not. If they are pointing the same group that's good. If they are randomly pointing to random groups not always the same that's bad. Yeah so those are the those are the goodness of heat plots so we've got the we've got the thing that stands for the reproducibility for the best fit within the run that is the closest in terms of the minimization. We also got the plots for the sparseness and we've got two matrices the green one is for variables the red one is for respondents that's the same here the silhouette the greater the better the greater the better so when you lose the statistical features you get you get the sparseness and also it's getting better in terms of stability when it's when it's greater than five to six and there's also a consensus clustering that can have the same reasoning for the silhouette that stands for the proper assignment to a specific group and based on those plots which which k would you recommend for and client? Five Five okay any other options five is not that bad but has really weak stability so you don't see the most crucial part which are the segment sizes so it might be like that for the solution equal to five there is one huge segment and four smalls which is not the best and here in this situation we in the end picked seven since one segment was really small and we rejected the segment and two other were having the same interpretation they were having the same variables that described them so we could group so based on the on the actual k equal to seven and we ended we ended up with five with five segments so that's not the full story also got a look at the sizes and the values that come from the decomposition so that was a treat it's really hard to guess just by those just by those plots and the consensus clustering that I that I promised to show so once again that's for k equal to four k equal to six you might please remember the process 30 rounds in the end 30 assignments four features to certain groups and based on the coherence whether they were in the same group we can have the concurrence matrix which is a distance matrix and we build the hierarchical clustering and what is displayed here those are those features here and here and the middle bar presents the consensus clustering groups here it's not visible and you see highly concurrence within those features that are in one segment and the smaller values which means they do not concur with those from the different segments sizes and also the bar that presents the cd-wet which is quite high for a for a single for a single feature and it's not that good for k equal to six since those here are two segments the one and another and they concur and they do not strongly occur inside this segmentation so based on that in this scenario I don't remember whether it's for teacher or for parents I would pick four instead of six in this case so based on that plot the sizes the description and the consensus matrix which stands for the consensus you are able to determine what's the k and it's not that simple I needed to look at that for a week or two to actually grab the idea but the slides will be available so there's there's a place to start okay so there's this whole machinery we do the survey clients influence the survey structure the questions then we get the data then we do this we do the segmentation and in the end what's reported it's actually such a small such a small tables where we present the methodology that we've used table two four five four six and we report the silhouette scores I will talk about it more in a bit the dispersion and what's crucial the largest segment for one segment for the solution for four segments it's not that big and the smallest one it's not that small if there are six segments and also the best solution for other techniques with the silhouette score and the number of segments for which the silhouette score was the highest so we see that that's too small and the silhouette scores are also small and there is a huge difference between the non-negative mathematics approach and the silhouette score but to be honest it's not just the NMF it's kind of a bit mixed with the factor analysis so it's not just the NMF but it was kind of level higher more complicated but you can think about it like that those are the differences in the typical approach and in the approach where you are digging and digging and in the end you find the solution more people can convince right now some say yeah if I'm going to be second like a lot is used mostly for the like very high-dimensional data right like many rows or it can be used for the small data as well I think like every technique can be used to any type of data whether if it's you know if it brings a good solution so based on those scores that doesn't look that any of those looked good and if there is special requirement about Clara so it should work I don't know I have no idea if there is any special any special numbers that should be the minimum and then we also present the story so it's not only numbers but we've got the segment the size that's the follow the guidebook teacher and based on the self-values for the columns the composition we know to which features which which features look at and we can create the story those are the over index variables those are the under index variables and we report the percentage of people that agrees with statements those that also agrees with the under index statements and those are those that are not visible in this segment in comparison to the full population and those are highly visible and there is and there is a number that means that when you take the proportion of people that agrees in this statement the proportion of people that agrees in the whole population there is the ratio is and the mean and the baseline is 100 the ratio is 300 so it's three three times more visible to agree with the statement to the comparison of the full population so that's the description for the segment and that's the follow the guidebook teacher because they reported that they think that students should be treated equally and and the education should not be I would call it personalized that's why it's follow the guidebook teacher in the story so follow the same patterns all over and all over the time and no improvement in the education system getting to the to the end notes yeah so it has the start point so you are it's better to redo it a couple of times the run selection is the crucial part you've seen that it was really not that easy to determine whether it should be five or six segments in the end we show the solution for fuel case and then the client decides what brings the best story and the future classroom is provided like you're gonna see the hit months in the workshop and you're gonna see the groups on those hit months like the the factorization we had this factorization we had those are values and we can also have the the hit mark based on those values and we can also cluster cluster the team and we can we can see groups that the features are in the specific groups since they are describing the same sentence hey yeah and the credits it's not our solution it's the cram package code and we are just the users we are not we are not the software developers we don't like all the credit goes to goes to outdoors and if you like the talk I strongly invite you to join the wire conference that is happening this September in Warsaw and I hope you enjoyed the talk since that was a part of our marketing trip and I'm really glad I'm here in Armenia for for the invitation that Havett sent to me and if you would like to join the art conference and listen to similar talks we can talk about like how to get to the to the to the Warsaw in the cheapest way or how to get a discount for the conference and yeah thanks for listening listening to me I know it wasn't easy and if you have any any questions I'm free to answer and I hope we can also go for a beer and talk more about the stuff okay one there one over there yeah we can go first yeah so I was wondering whether you have in contrary case where sometimes you have other constraints on the on the segments for instance you don't want I'm thinking of a segment as a cluster so you have other constraints such as I don't want the age for instance to vary too much within the cluster or I want the gender to vary so I don't want to have like boys in one group and and girls in the other group so do you know how how to deal with this kind of stuff I've never done it so far for this case but I think the but that just my guess that the that the scale of the variable is the influential factor for the factorization but I'm just guessing you probably can stick it into the loss function somehow or yeah immigration objective or you could yeah that's better than mine okay so we can now get to the workshop part so guys if you would like to redo the analysis and see actually the composition and see the heat maps and there is a special need have repository where you can download the materials you guys the wi-fi password what's called is a ua gist and the password is a ua gist three five seven yeah so in the next I think 45 minutes till 60 depending on how fast we are we gonna just run the code to redo the analysis but can you post this link to the facebook event page I could I'll post it I'm posting the link on the event page yeah just try to find it like that okay and if you are on the on the page depending how familiar are you with the gist if not in green button download and you can download this that's different thing that's different than the classes we had in the morning I hope I posted on facebook and and here is the same link that you put on the facebook right I hope yeah I copied it oh I put it into the are your users instead of yeah yes yeah it's like that so you can't just you can't post on the event page you need to get some approval but I think that is funny when people can find it and then you click download and when you hand zip then there should be this file called mf workshop inside and if you double click the ask to this should open that kind of setup here is the project name all the materials that are needed and here we will do the coding on that who's ready with the setup that's crucial and then I'm gonna speed up all good everything I must see we got the script ensuring everyone got the code oh yeah those people that are ready can go to scripts and then there is a script read data and at the beginning you can install the dependencies if you think you don't have those packages and then you can hit library library so there are few techniques to run the code or the moment you are in the line you can just run control enter or you can select the full line and press run on the top and still when the full line is selected you can still run control enter what was the survey about as mindset the mindset on the location okay so I'm going to start slow so that we can then follow up the line for free reads the survey and the next few lines just creates the dictionary for the column names and the column labels since you can check it shouldn't be regular data frame so you can see most of the data are collected as in SPSS files which means they have names and labels and the moment you go to the view you can click on the on the data name when you go to the view you see that there is a question and there is an extra label our education system is not supported to install a common understanding understatements stuff like that and the answer is coded one so we disagree five degree and we can also have we can also create the dictionary the next few lines and by dictionary I mean just presenting the head of the set that we've got that we've got the column name and the actual label that was displayed for the respondent and the survey data like first first five wrongs columns and we see that the class is double this label because actually there's numbers that have labels so we see that five stands for agree one for disagree and also I wrote that down and we can check just one just one column to see that this special type of the data called label an extra attribute labels that display you what's the dictionary for the labels okay that's how we read the data and we can run a few simple statistics that's still the same script guys simple statistics present us the mean or each question the median some notification and that's the same with the dictionary so we can see that the statement number 13 I want my child to be the best at whether she he carries that being at the top of the class whether she cares is the statement that those people the those that this is for parents that these people actually agree and they disagree with the statement report calculate should be should compare students against each other that the education shouldn't be personalized so parents disagree it should be personalized all right so that's just the script the first one to read the data have you managed to read the data I think so yeah I can see your faces so two more scripts two more scripts guys next script O2 traditional segmentation so we will use the CL value package that does all the stuff for us so you can install the package if you don't have it you can load the package if it's not loaded we can set the random seeds since few of those algorithms have running procedures or running starting points and we will run the clustering for the survey data the moment we change labels to numerics and we will run it for k from the range equal to 3 to 9 and we will use all the hey all the parents and the method methods that we will use is the use of typical ones heretical k means partitioning around me though is and the one that I don't really know all righty so what it does it runs for those various case I mean runs the n plus from from 3 to 9 run those algorithms and then it summarizes the fit we can see few statistical measures like the connectivity done the silhouettes and then the interpretation the interpretation is that the done in the silhouettes should be as big as possible and the connectivity should be as small as possible and for all those methods on the left and in columns we've got case number of groups we've got all the studies that are reported and also give us the optimal scores like the best solution here is the heretical clustering with three classes but not much and the silhouette score is kind of poor you can just get the optimal scores if you don't care about anything else and you can also see the plots the same metrics are on plots the connectivity again for various methods for various number of case the done and the silhouettes and then there is also a line I'm speeding up because the last script is the most important and the last line tells you how to extract the segment assignments so you can always get the segment assignments for those different different methods and from the object we can extract the parts for the cluster ops and then there are four methods for which we can extract the clustering assignments from k equals 3, 4, 0, 9 so we know how to do the traditional clustering and now we will get to the fun part which is the NMS segmentation and I'm going to slow a bit here so we can load packages and then we can prepare the data it should be a matrix with integers as a starting point so I'm preparing the survey as a matrix all good guys so maybe you can just follow what the guys are doing and you finish the first script okay so you might go to the third one and I'll stop and I'll wait for you yeah let's see the third one yeah the third one the NMS segmentation and how far are you guys? what's your number? just make sure you're okay all good? got your name? very normative hit yes please just hit yes that's your number? just have a moment so what can you do? make sure you're too fast make sure you're too fast too fast? okay that's where we're going to work give me something I think it doesn't work because you are not unscrupulous I think the director is going to extract it mm-hmm yeah it's just amazing go one level higher I think yeah and try to extract this one anywhere? yeah I'm going to have to cut it because this path data stands only for this project and the data there okay okay so I'm going to set the path data and try to send it out looks like it's all right yeah yeah you're better you just run the script and we are starting with the NMS okay it's all right but I was thinking you're going to the NMS you're going to the NMS you're going to the NMS you're going to the NMS you're going to the NMS you're going to the NMS it looks like you guys you can go yeah yeah you're good you're stood up yeah I think it is all like you just to use this help that's good it's all right ah and I think you need also the NMS or you've done it already NMS we have done it already okay so you can do all we have done or not try running the path and then you will know try running the path just just the path just the fault just the short the fault short life yeah you've got everything you need and then the next one we need to also detail okay no problem yeah my best probably we got to have one more package here library library so there's so there's a new element one more package guys in previous slide we have already started this is simply the start just it's simply ah it's included oh yeah we did in the first one we yeah yeah we did okay so cash cash yeah okay I hope we are all there 20 more minutes and we're heading for it here I hope can you okay so I'm getting the survey data in the matrix format I'm at the line six here and then normally you would run line 13 to create those three runs for those case from three to nine but it takes some time so I've run it before I stored the model and you can just read the model so it's five minutes less so we can read the data read the model so you already have the model saved because it's taking too much time like not too much like five to 10 minutes but it's always much you know people can get bored in five minutes and I don't have that many jokes yeah so we've got this full model that has those 30 runs for those various case and we can plot some stuff so the moment you run plot for this object and you specify what you would like to have to be plotted then you get this dispersion stuff the city oh sorry can you explain this plot just in more details one once more okay maybe what is the best fit so so why dispersion is only calculated for the best fit and then for the others you have that's a good question so we win 30 runs there is one that is the closest the minimum thing we win the minimum function and that's the one that's that's the best fit okay so what is the basis the basis is the matrix for respondents the coefficients is the matrix for variables so that's why we have the sparseness for those two because we decompose matrix to two new ones and calculate the sparseness for for the solution and cof is for the segmentation red is for the respondents green is for variable and the pink one is for this consensus that we saw the concurrence of variables will be so higher dispersion is better or it should be smaller better higher better or we silhouette higher better what about the sparseness also higher better also higher better so you've got the confusing information here and here and also you would like to have some solutions around four and six and that's not the best here so that's always a compromise there is no golden rule there is a compromise you got to check the sizes you got to check different plots so let's maybe have a look on some more plots so the moment you run this for look it goes for i've trimmed solutions so we don't look at all of them it goes for solutions k equal to three five and seven and it creates few plots for each k from three five seven it creates the consensus map that i showed during the presentation that's the one and it creates the heat map for respondents decomposition and the correct map is the heat map for variables decomposition so the moment you run the code this one you can see in the plot directory some plots so let's open and the first one basis three row fourth okay so you remember the decomposition was decomposed to the row to the respondents and to the variables and that's the matrix that's the matrix for respondents that is normalized so the row sums to one and we can see that this respondent has high value in the third in the third segment and this has the high values in the first segment those are in the middle one and we can also order that stuff so we lose the order of respondents but we can actually see the groups so the second plot with the row of true that's the second one also uses the theoretical clustering on on the rows so that we see how do they how how many groups so based on on the reorder heat map you can see that the group actually exists and they are visible it's not just random noise in in in in in this algorithm so what's more interesting let's see right now maybe the coif map for five the coif map that's for that's for features and we see features here the values are normalized to sum up to one in the columns and those features have high values for the high values for the last segment those describe the first segment those are visible for the second and the rest is not that important not not that outstanding as those results and we also have the consensus clustering so how they grew up how they group in the consensus so not that good this time for this solution the consensus should be rather if the if the noise is high they should if the signal is high in this segment they should also be the same segmentation group and that's and that's the clustering that you have for respondents that the groups are represented also by by the columns so there's transition between when there was a matrix for variables there's a transition between rank and and the rank and the rank for respondents and that's that the segmentation so the solution equal to five doesn't look that good for free we know how to interpret the the hit mark for the basis segmentation and let's see the consensus mark consensus mark for let's say seven and the consensus here is again in the middle so we see one two three four five six seven kind of not not consensus here and those rather should go for often so those solutions are not not always that effective and not always that good so you need to you need to verify you need to have a look at that stuff okay so what we can get also from the fit so if you look at the line 39 you can check the structure of the object so the object has a special field called fit where it stores those matrices so the moment you would run line 39 you would know that you can get those matrices for respondents and for columns with with this run so that's line 40 that's line 40 and you get those segment assignments for respondents so the first one should rather be assigned to group two but it's close to group three the other one should be assigned to group one and so on the third one to group four and we can have the same for columns and sorry sorry sorry so we're looking at the respondents but groups for k equal to three this time this is when you're is rounded and k is equal to two or three it's rounded to two digits after the so what this means 1.86 like the first line those are just those are unexplainable numbers they just show you the magnitude of how you so how you know which one belongs to which group you pick the biggest pick the biggest okay and on the hit maps you saw the normalized versions so you can also treat it as like probability assignments okay and the breakdowns the next line 45 what are the sizes of groups so we see for the line 45 what are the sizes segmentation equal to three so so split segmentation equal to five okay we see one small segment and segmentation equal to seven yeah and there was one group that we removed and two segments that we joined so that's fine and and and then speeds are good there is no huge one and no no no small one but actually we we removed one the next line you can get the predictions which is which are the assignments for respondents so that's the head first six people are assigned to segment 545511 you can get the sizes and what I'm going to also present is based on some helper functions and if you run those next lines from line 60 till 70 till 72 it might take some time I'm calculating this index to verify what variables are visible in which segments to compare whether the NMF already told me this story so for k equal to three five and seven I'm calculating question breakdowns how many people disagree how many people have no opinion and how many people agree let's see for the k equal to three what what I meant okay so there's one question and in segment one thirty three people have no opinion thirty one people disagree thirty five agree and in the in the whole population not divided to segments forty five percent of people agree and based on the proportion of people that agree to the to the population we are able to better remind that this is visible variable that could describe segments two in solution equal to three and that's what you would do if you just have the segment assignments and but you don't have the story you don't have the features that already are pointing to segments you can just go and calculate this index so we've got a function that calculates the index so let's use this function line 77 gives us such a story gives us such a story that the top five over index variables uh segment one solution equal to three top five variables that those are those top variables have high indexes so they are more visible here than in the than in the population and what's the percentages of people that agree and now we can compare this manually calculated index to what's presented by to what's presented by the nmf so we can look at the matrix that stands for that stands for those variables that numbers are just rounded to two and we extract those variables that are over indexed to verify what nmf set that was for the segment one and they have the biggest numbers in the segment one so that's how you get from the nmf to the to the variables that explains the segment so this is something like the factor loadings could be but the yeah by interpretation it could be yeah you can interpret that but by factors in most cases yeah that's the factor loadings yeah so for those for those five questions that are actually the most over index in this segment they were already presented in this matrix that decomposes variables and also there were five under index that people rather do not agree and those are with the smallest numbers or is the segment and that's how I believed in the nmf segmentation that the numbers he provides the under index are the lowest and the over index are the highest and that's placed with the actual index calculations so so mindset the very one where we have two k is equal to two so it's mindset 11 0.89 can you just explain what it means yeah this one yeah so now we are looking at k equal to three the matrix for variables it's really long this matrix for variables so maybe I will just okay okay and why is then it's another argument two what it means and the two is just to round the numbers round round yes from round in this matrix they are having 16 digits so 0.1819 means that so if I'm looking by the columns and the maximum that I get is showing that the given if the segment that is described by this variable so all these variables are describing segment one yes and I've picked those because I've checked that they are over index but you can do it other way look at the nmf and then you see okay that place with the over index and you can redo for the rest of the stuff that this this is true for the rest that the over and under index is playing the same story and that's all I wanted to present you guys then there is a full procedure to describe segments into meaningful story the report presenting the stuff but that's not the work that I do in my company but in my opinion that's the really allowable skill to gather those results and create the story and present the report and be convincing during the presentation so if you ever wanted if you ever wondered can data scientists go further so I would I would tell you to invest those in those I don't soft skills yeah I'm done if you've got any questions thank you