 In this video I'm going to discuss K-means clustering and we're going to use Python and more specifically we're going to use scikit-learn as our machine learning library. So what is the hypothetical example that we have here? Imagine that we have 99 patients and they're all using a smartwatch healthcare data app and this app continuously gathers data on them for a study but to comply with healthcare regulations they need to actively sync that data with the researchers. Now to remind them to do so there's this campaign of sending out emails, short message services, WhatsApp messages, pamphlets, telephone call and then long letters and that was so there's six of these campaigns and they sent out throughout the year. The campaigns were numbered 1 to 32 and each campaign has only got one of these so say campaign number three for instance would only be WhatsApp message. Now every time a patient responded to one of these throughout the year that is recorded so we know what patients responded to what campaign and the information that we really want here is to know what campaigns certain patients responded to and we want to use K-means clustering to do that. So in this example I'm going to get five clusters and remember that five times seven, seven is the magic number here is 35 and that's way less than 99 we've got a four within that margin so that we don't it don't have too many clusters I'm going to use five clusters here and the business intelligence in here is just going to tell me and in that cluster which of the campaigns the patients responded to and which not so you can well imagine this might also work in a business sense where there's a marketing campaign and we're just going to look at the response with respect to sales. So let's construct a little notebook we are here inside of Jupiter notebook I always run this little cascading style sheet you can see it's style.css it lives inside of the same in the same folder as this notebook and just the way it's set up nice little blue and orange here for h1 and h2 and the text etc. Now I'm going to use the pandas library here just to manage my data and from that I'm only going to use the read excel and the merge function so then import the whole of pandas and then we also going to use the a range function there in numpy and then here we go scikit-learn so we're going to go in the cluster and decomposition sections of scikit-learn and we're going to import k-means and then principal component analysis here. Now I'm going to perhaps make a video on that all on its own but you'll see when we are going to use that then I'm also going to use this plotly to do my graphs it's not part of anacondas you have to install it with pip install plotly separately from the graphs objects in plotly I'm going to import go go has the scatter plots box and whisker plots etc just to make it easy and I want offline plotting I don't want to have to go to the plotly servers for rendering and for that I'm going to use the initialize notebook function here and then also the iplot function and then I'm going to instantiate this or at least start this or run this function in a notebook mode so we can have the notebooks render right inside of my Jupiter notebook here so let's look at the data that we're going to use I'm going to create two data frames it is the patient response.xlsx spreadsheet file excel spreadsheet file it has two sheets the first one by default is always zero and then the second one is one so I'm going to import the two sheets as two separate data frames here first one's going to be called df underscore campaign and the second df underscore response and I'll show you what's inside of them so you can understand how they were constructed so let's have a look we have let's just run this for plotly as well yeah we've run so let's just have a look at the campaign the campaign data frame so we see the campaign id so we here we have for instance the last five I used the tail method here 28 it was a long letter sent out November SMS is sent out November December there were email which was campaign 30 a whatsapp campaign 31 and a long letter campaign 32 let's look at these responses and so here's the last five you see 323 responses from the 99 patients here we have patient number 99 in the India responding to campaign 32 and what I've done is I didn't show you here I've added a new column called n just to count because each of these is one instance patient 99 responded to campaign id one stay and I want to be able to capture that as one instance day of this happening so just fill this column with ones now I'm going to merge these two you see the campaign id we can merge on that campaign id column so I needn't have put on equals campaign id and my merge function because that is the only um in the column there that's that is identical in both of these so let's run that and look what a what a what our merged data frame looks like now very nice campaign id what it was when it was run and the patient that responded to that and we just counting one so let's create this pivot table to count each of these 32 campaigns we have I'm going to create this table data frame I'm going to pivot and I want patient to be the index and the columns to be the campaign id and count the values so what I want is for these 32 campaigns to become the top row there the column headers for my pivoted table let me show you what that looks like so there was campaign one campaign two campaign three and then the patients on this side and so when the patient did not respond to that campaign at the moment that's going to be an n a n value but you can see the patient 95 there responded to the campaign 25 for instance and campaign 30 now what I want to do is just to fill these n a values with zeros and just reset this index because I want obviously patient here to be the index let's have a look or at least campaign id to be this campaign id to be the index so that's campaign id here down here but we can see the we can see the one two thirty two campaigns here and then for instance then patient number 95 we can see again responded to 25 and to 30 and not to the others then I'm also just going to extract a columns index here so by just calling table dot columns one and so that gives me this one two thirty two here because I want to use them later on so that I can just keep this separately and that's when we're going to use these this the principal components here just to to decompose into two dimensions basically so let's use our k-means clustering here so I'm going to instantiate this and call a cluster k-means and I'm going to use five as I said I'm seven thirty five which is way less than my 99 patients so we save there let's just see that that was not imported up here let's do that there we go that's the one that I missed let's just wait for that to run we're just going to import then the k-means and the clustering so let's go down and just run that there we go now I'm going to fit this k-means clustering and I'm going to have the result inside of a new column in my table data frame and we can we see here the cluster dot fit underscore predict table table columns from two so let's have a look at what that's done now so here we have this cluster so it looked down the row the campaigns that this patient responded to and it puts that patient here in cluster number three now it starts counting at zero so that it actually cluster four so the big cluster one and that big cluster four now I'm going to do this principal component separation here because I want to reduce this dimensionality down to two and as I said I might make an extra video about this principal component separation just so that we can plot this as two as a two-dimensional data center in a scatter plot so there we have into an x and into a y and you can see we use this pca function here fit transform and then for the columns that's why I use my one to 32 so that's going to be reduced down to an x and a y coordinate I'm just going to create a new data frame because from table I only want patients cluster x and y and so you can see and you have now patient cluster x and y I'm going to merge that with the response data frame and call it final and this final I'm going to merge with the campaign so that I have this neatly all in a row campaign 31 was whatsapp december patient 64 responded to it that's the count of one fell into this three which is cluster four and it has an x and y component y because I'm going to use plotly to do all these scatter plots now plotly works by every individual thing on on the on the figure itself is a trace so I'll just call it trace zero to trace four go dot scatter so it's a scatter plot which takes an x and a y argument here the x is going to be taken from this x column so I want to say patient cluster and with a cluster equals zero is going to be my x and with a cluster zero is going to be my my why I take the y value from that and x and y value for that to populate my x and y the name is then for trace zero cluster one and we can see I'm going to color it and put a line around it and I do that for all of them until I make a list of all these traces for data and I'm going to plot the data so let's go there we go beautiful interactive plot and we can see the clustering happen happening you can see the clustering happening with the various colors and here's the business intelligence from that so I'm going to create a column it's going to be populated with true or false so column zero there'll be my cluster one as I'm going to call it this column zero where all where the cluster was equal to zero and then I'm just going to do a value count as I group by the true and false is in there so here are the patients who fell into this zero which is cluster one and we can see that they responded to what's app a little off to telephone and then very little to sms pamphlet and long letter remember I had these six sms there what's app is there pamphlets are there and telephone is there and then the long letter is there as well so we can see what they don't respond to we can see what they don't what they don't respond to I think the one that we haven't listed in this is email so the email was zero so there wasn't even email on here so we can see in future which probably send this cluster of patients we should send them a whatsapp and perhaps a telephone call as well that's what they responded to according to this instance of running the k-means cluster I can also then just see how many there were so there's forty five forty nine patients and I can actually just list them as well so I have them and I have a list of these patients now and I can target to them the campaigns running the campaigns in the second year so that's k-means clustering very easy to do in python why don't you give it a go