 So our tool of choice for this kind of tabular data, so like rows of data will be pandas which is, so pandas is a Python library that will help us manage tabular data efficiently. And in pandas we have data that is stored in series and data frames. So series are like just lists of data. Start from the beginning, okay sure, okay sorry. Alright I just go through again from the start. Yeah this is intro to data science, I'm Javier. So yeah the quick disclaimer, this is intro to data science, not AI, not machine learning, nothing fancy. We'll just cover some simple stuff like plotting graphs, like linear models and classifying stuff. And the workshop aims to give you an understanding of the data science lifecycle. And yeah, so once again we'll be using pandas to load and access our data with Python. Then we'll also be doing some simple exploring of the data with mapplotlib and cpwn which will help us plot graphs and we'll then look at how to interpret those graphs. Then we'll look at data modeling which is like drawing line, best fit lines, desktop regression and also classification. I'll go through into these more later but yeah. So data wrangling, this is where we would analyze our data. Before we even analyze our data or plot our data we need to load it into the code. And the first thing we'll do is clean the data usually which is like removing with missing values or empty rows but that will be skipped for this workshop because it's not too important for the understanding. And our tool of choice for this will be pandas. Pandas is a Python library that helps us manage tabular data efficiently and in pandas we have data that is stored as series and data frames. So series are like list of data. So maybe just the number for each person maybe there is a series corresponding to their attributes maybe like their hair color, their shoe size, all that kind of random stuff. And we also have data frames which are tables of data. So it can be like a series of a series in a way where each data point is identified with an index and a column and we'll look at some actual code so it's easier to understand. I'm not sure if you will send this link in advance but I'll just show it on screen to be sure. Give me a second. Okay so if you opened up this iPython notebook so yeah if you have the link and you can open with Google Collaboratory that is the online Python runtime that we'll be using so it's standardized and easier for everyone. I'll leave that up for a few seconds. So the first thing we need to do when we're using any library really is to first import it. So yes and it will give you this warning when you try to run it but just click run anyway is nothing harmless for. So it's run. So the first thing we do is to load the data. In this case our data comes from this URL which is of a wine data set in this case and we can do the loading with pandas.readcsv.csv is the file type and when we run that we get our data frame like I mentioned earlier. So a data frame is just rows and tables of data. You can actually look through each individual element if you wanted to and on the right sorry on the left here you can see these indexes. So the index refers to the number of the index of the wine in this case. So the 0's wine, the 1's wine and so on and we see that we have 4,898 rows which means that we have that many number of different types of wine in the data set. And then on the top we can see that we have these labels like 6th acidity or citric acid all that kind of stuff alcohol quality pH all those are the columns of data. So if we were to access like 0 and then quality we would get 6 all the way on the other end. So that just tells us that the quality of the 0's wine is 6. We can obviously when we want to access the data we also use it with code. We can do that with this log function. A log is basically how you locate the data and when you say log 0 you're getting the first row. And one thing to take note here is that log returns or in this case it returns a series. So on the top we had a data frame and then now we have a series because it's returning just one row. And you can see it has all the attributes that the row and columns initially had just that this one is only of the first wine. You can see here if we call type we see that it's a series. We can also access the columns. So like if we have log and then we have a column comma we can put the name of the column which we want to access in this case chlorides so it gives us 0.045 0.049 etc. And if we go back to the original data frame yeah you can see that in the middle that's exactly what we got 0.45 0.49 all that. And we might need to access multiple rows and we can do that by passing a list of the indexes. So practically this would be like when you have just specific maybe elements that you need and you want to ignore everything else this is how you do it. This is how you would do it. And yeah you can see that this time it returns a data frame not a series because there's more than one row that is needed. And it's exactly the same as the original data frame just that it's just the subsection that we wanted. And if you call lock again but we put a list with one element we actually get a data frame just with one row. So this is different from when we called lock with just the element directly. The first time it returned the series. The second time it returned a data frame. We can do the same for columns so if we wanted chlorides and the pH row and the pH column we would call lock colon comma chlorides pH. And that just gives us the same data frame but only with the chlorides and pH. We can actually do both at the same time so if we only wanted certain rows and we only wanted certain columns we can just put both of them together and we can access just 0 to 199 and just the chlorides and pH. So we also have I lock which is similar to lock in that it uses the numerical indexes instead of the names. So if you remember the chlorides that was on 0 1 2 3 4 it was on the fourth column. So if we only knew it was fourth column we could call I lock instead of giving it the name which was chlorides. And once again we can do both we can access the rows and columns at the same time with I lock just as we did with lock. And a nice little shorthand that we have is that when we just want to look at the columns or just one column we can just call df no need I lock no need lock we can just put the column that we want. Similarly we can pass a list of columns that we might want and that just gives us yes. So I just want to reiterate the reason that we are looking at these functions is because when we are given any data set or any any real life data set we are not going to use everything in there typically there will be some things that we want to filter out or stuff that is not needed or not relevant to us. So with these functions it can help us to remove and filter out things that we don't need or don't want to see it. So just a short exercise for you is to filter the data frame so that it only contains even numbered rows. So earlier we had earlier we had a way to filter to 0 to 199. One thing that we might want to do is to access just the even rows. So just a short exercise you can try it now. So I have left for a bit now I'll just go through two possible ways to do this. So one possible way is to just with the element if you remember the list accessing we just make a list of everything that we need in this case just the even numbered ones then we pass that to the iLock function which will give us exactly what we want. Like in 3 only has 0 to 4 all the even numbered rows and another possible solution if you're familiar with Python indexing maybe you can actually do this which is just saying get everything but skip 2 so it will go 2 at a time and yeah it gets us the exact same thing. So I'll just pause for any questions if you guys have any. So it's just like a quick recap I guess of everything. So this is just a way that is for us to filter the data is a very simple way but it can help us to suppose like our data set we only want to look at alcohol and quality so this is how we would do it. So depending on whatever purpose we have for the data we will do different manipulations to it. I'm not seeing any questions in chat so I will move on to the next section okay so that was data wrangling which was just to help us filter the data to help us really locate on what we want to focus on and now we can look at data exploration which is the next step and yeah before we can model or analyze the data we want to understand our data. So by making useful plots and statistics and just really playing around with the data we can get a lot of information about it and we can uncover patterns in the data anomalies like missing values and we can find points of interest and this allows us to get a deep understanding into our data and it's very useful when we are modeling the data which is the next step after this. So okay so stuff like extrapolating data points that is more under modeling so that will be specifically that will be regression which we'll cover later so one thing that we can do is once again with the data frame we can call dot describe and by calling dot describe it actually gives us these few statistics of the numerical data so this was a categorical data it wouldn't really make much sense to us for like the standard deviation but for numerical data we have this nice summary so the first thing that we can see is that it returns another data frame so count it returns the count and the count is just how many elements there are so again there's that many number and the next thing we see is the mean and the mean is just the average of the entire row or column sorry so STD that is the standard deviation it's how spread out our data is so standard deviation that would be the for one particular value we can have a calculation of how much it deviates and then on average how much do the entire data set deviate from the mean that it will be the standard deviation much smaller so yeah mean is the minimum value that occurs so 6.3 is the smallest fix facility that will ever see in the entire column and then we have the 25 50 and 75 percent all it's basic so the 25 percent all is the value so that 25 percent of all other values are smaller than it then the 50 percent all is similar is the value so that 50 percent of the data is smaller than it same for 75 and the 50 the 50 percent percent all is also called the mean the median so it's like another way of thinking of it is another kind of average in a way and then at the end we have the max which is just the maximum value so that was for numerical data for categorical data we will also it's we will call like stuff like unique which I'll go through later and what we can also do is plot the data and draw graphs so things like a histogram that is histogram KDE and box plot those help us look at single variables of data we also have multi variable plots which will look at two variables or more at the same time and show us any interactions they might have so we might have a scatter plot which is just saying for each X and for each two of these we put a dot where the coordinate is yeah so we just plot a point wherever the value is again we have the KDE which is the kernel density estimate I'll go through what that means later and then we also have a categorical box plot so I mentioned the histogram and KDE so the histogram is the bar chart that you're looking at on the left basically it's just been it's just saying how frequent is this range of data so you can see the 170 to 150 to 170 that is on there and this range of data it has the most it's the most frequent it's the most common and you can see the smoother and the smoothened one that is the KDE so it's similar to a histogram but it's a bit it's plotted smoother and it's called a KDE because it's estimating the curve so this these two just gives us an idea of what our data looks like on a trend so we might see that there are a few values which are very common like the middle and we might see that there are a few values which are not as common like at the end we also have the box plot which is just a quick summary if you remember the dot describe function the box plot is similar to that but it just gives us a graphical representation so we can see the upper extreme that would be the maximum value and the upper quartile that is the 75 percentile the median is the 50th percentile the and so on so the whisker is just drawing a line to show us where the data ranges and yeah we also have outliers which are drawn for us if they stray too far from the center of the data it gets drawn as an outlier instead so these plots help us to understand the just single variables and how the individual variables and values are spread okay so in this case the outlier is calculated by saying if the stand if it's too far away from this central range so if it's 1.5 if it's more than 1.5 away from the if it's more than 1.5 away from in this case the lower quartile 1.5 of this range sorry of this lower quartile then it's said to be an outlier so technically yes so technically it could not be an outlier and that's the issue with a fixed formula but in general it gives us a just an understanding of this data point is extremely far away from the center so for even in a normal distribution this formula will say that about 1 percent if I'm not wrong of the data is an outlier norm sorry so yeah 1% of a normal distribution will be an outlier yeah okay so again we'll look at some actual code and in this case it will be exploring the iPie notebook if you have the link you can go to it it's not I'll be showing it on screen okay so again yeah just run it so this first chunk of code it's just the same from the earlier but it just loads the data into the data frame so like I mentioned earlier we have the dot describe function that gives us these useful statistics of the data like count mean standard deviation all that so yeah again how you read it is that the dot describe returns another data frame from the original one and the rows are the different attribute the different types of statistics and the columns are the original columns okay so for categorical data like I mentioned we have another function value counts so if you look at our if you look at the original data frame we have this quality at we have this quality column and the quality is given as one two three all the way to nine so and it's individual numbers it's not continuous so it doesn't really make sense to ask about like the standard deviation of this category so instead we can call value counts so we can read this as there are two 2198 occurrences of the quality six there are 1400 ish of the quality five and so on yes so now we can look at some plotting of the data so we usually do this with the pandas dot plot function that gives us a or it returns a mat plot lip graph in this case and yeah like I mentioned the three common types will be looking at the histogram the box plot and the KDE which is similar to the histogram but it's smoother so before we use it we have to import it so if we look at just I guess if we once again if we just call this we'll get back a series but we can also from the series call dot plot and then we specify the kind of the plot in this case we will plot a histogram and that gives us this so we can see that the six to seven like we met and we saw earlier that is the most frequent occurrence and all the others are less than it and if we change this kind attribute to say KDE we get a smoother one you can see the similarity between the two plots so the bottom is just smoother the value sum to one because it's a density and it gives us a estimated of an estimate of how it looks like if it's smooth and one thing I want to touch on is that okay so this chunk of code you don't really have to understand it but it just generates this a picture for me so if you look at these three histograms that what I'm doing between these is keeping the data the same but I'm changing what's called the bin so the bin is just how often we cut we group together the data so if we have a very large bin we see that it's just one two three it's just these three main spikes but when we have a large bin it removes the finer details it removes the finer details of the data so it might be what we're looking for but at the same time it might be misleading so we really have to look carefully at when we plot something what is the parameters that we're asking it to plot as so even though these are all the same data by changing this one attribute we're getting different plots so and on the last lot it's a little more noisy because we're almost not grouping it at all and yep with too big of a bin we actually lose a lot of the detail about with too small bin it can get a bit noisy so it's harder to interpret so yeah Katie is the con the smoother one which is this they also have a similar problem you can try it for yourself it's called the BW method you can just do that for yourself yes and so we are back to the plots we also have this box plot so the syntax is the same it's just Siri get the series dot plot kind in this case we want a box plot so we can read this as this bottom line that is the minimum value that we'll see and at the bottom is the name of it the green line is showing us the median which is the 50th percentile this top apart is the 75th percent 75 percentile and the final part here this is the maximum and it's the maximum before it starts saying that the points are outliers and everything above that that we have termed and we've called it an outlier so for the residual sugar we can see that there are about four outlier points that are way higher than everything else and that might be something that we have to take into account if we're doing modeling of the data so that was just for one individual attribute but we also want to look at two attributes at the same time because that's how we see the relationships and the interactions between the data sorry yeah so we have an understanding of the individual features of the data but we also want to know of how they can interact with each other so one simple way you can do it is with the scatter plot and basically for each data point we take one attribute as the x-coordinate and another attribute as the y-coordinate so we have a function here plt scatter it's no longer using the data frame plot syntax but basically we're saying that we want the x-coordinate to be the data frame then the series citric acid and then we want the y-coordinate to be the fixed acidity and if we call plt which is pi plot dot scatter that's the scatter plot and then x y we get this so how we read this is that this one dot that is about 1.25 on x so as a citric acid about 1.25 and fixed acidity of about 8 and then we just repeat this process for every single one of the 5000 ish data points that we have and one issue is that for when we have many data points like we have the scatter plot can get a bit messy and it's harder to see so we can do what is the kde plot again but in two dimensions this time give it a second so you can see the shape is about the same but what this one is telling us is that a lot of the points are concentrated here because there's a lot of lines it's a contour line so as you go towards this center there are more and more points which have this value and you can see this chunk of values here that's the nice thing about the kde plot is that it makes it very clear where we have clumps of data and one more thing we can do also is to do the pair plot function because right now is we're choosing one attribute as x another attribute as y and with pair plot we can just automate all of it to do to say for each x for each y we plot one so yeah that's the pair plot function it gives us a very big graph and it takes quite a while to run and yeah i'm going to zoom in because it's a bit hard to read even sorry i guess i can't zoom in too much okay it's a little bit hard to read sorry but on the bottom we have what is plotting as the x coordinate and on the top we have sorry on the left we have what is plotting as the y coordinate so i don't know if you can read that but this one says density for example and you can see that along the diagonal these plots are different because it doesn't really make sense to plot one attribute and against itself so instead is so instead what happens is that when we have along the diagonal we plot a different kind in this case we specify it to be the kde so with this plot it's really hard to see really but for example one thing we can look is that this these two attributes they have some sort of correlation between each other because you can see them going up and if you read off it the first one is on the y axis is the residual sugar and on the x axis it's the density so at a glance we can pick out which variables have correlations with one another and like i said one of these is the residual sugar and the density so this only really helps us find correlations between two variables and when we are three or more we might need to do more complicated stuff like using the changing the color of each data point to represent a value so with this data set specifically we might want to predict the quality of the wine so let's try to target our exploration towards the quality so one way we can do this is to just say for each of the attributes we want to plot maybe a box plot of the quality versus that attribute and we just run this short part of code okay so that just gives us all these plots and each plot is the box plot where the bottom x is the quality and the y is the attribute that we're looking at so for most of these there isn't really much of a correlation but if you look at the very last one we see that alcohol has a interesting trend with the quality so when we're doing the data modeling step this is something we might want to remember and we might want to specifically remember that the alcohol and the quality have some sort of relationship with each other and like i said when we have three variables we might want to plot we might want to plot them together so we've already used our two dimensions what we can do is change the color of the data point it's a bit messy but basically we're plotting a chde again between the alcohol and totos alpha dioxide and this final part is to change the hue so if the quality is greater than five then we will have it separate from if the quality is less than five and yeah it gives us these two different kde's and yeah on average we can see that higher quality ones have a higher alcohol content and have but both of them have similar sulfur dioxide cons and distributions okay i'm going to pause for questions if there's any in chat okay nothing i guess we can move on first okay so now we can actually do some of the modeling of the data which is like the regression and the classification so in data modeling uh basically mostly two types of problems which are regression classification so when we want to regress when we want to perform regression it's like estimating a dependent variable from everything else so linear regression for example is just about finding a best fit line and classification is about classifying a data point from all the variables that we can see so one subset of that is binary classification where we just want to determine which of two classes a data point should be in so for example cats versus dogs that will be classification but predicting maybe a animal's foot length from its ear that would be a progression yeah so linear regression like i said it's about finding the best fit line really and a bit of math here but let's say a data point is x y and we want to find the best fit line which is why it goes mx plus c in essence we want to minimize the error between y and big y which is what is estimated and by minimizing the error we find the best fit line and there is a few there are a few algorithms for determining the best fit line and minimizing this error but when we are using the libraries it's extracted away for so we don't have to deal with it instead what we can do is just ask it to help us find the line so yeah we can see the short clip of a algorithm finding the best fit line and after a few iterations it does converge on that one line and when we have more than one independent variable usually we just have different coefficients for the different attributes so yeah when we're doing data modeling we also have binary classification and logistic regression specifically is a way to perform binary classification so the name is a little confusing but yeah so like i said earlier binary classification is classifying between two different attributes so the way we can do this is to assign zero or one as our two classes and then we do we take the linear regression and we put it into what's called a sigmoid function basically the sigmoid function helps us to make sure that the output of our of our prediction is between zero and one and once again we minimize the error between the predicted and it will help us to predict between zero and one which are our two classes that we assigned to it okay we'll go look at code again so let's have a look at modeling the ipython notebook it has some stuff on how we can do linear regression and logistic regression okay so yeah we run this first part which is just importing all the libraries that we looked at so far and getting the data from the same link as always and the first thing we'll look at is linear regression so and in the earlier step of data exploration we already saw that residual sugar and density had kind of a linear relationship and they had quite a strong correlation so we can try to put this into numbers with sklearn so the first thing we do is import the model which in this case is a linear regression and basically this function helps us to do all the load helps us to do all the fitting of the vest spit line so we don't have to do it ourselves and we can just focus on what we want to which is analyzing the data so the first thing we do is to extract out the x and y so from the residual sugar we want to maybe predict the density so we set the first one as x and the second one as y now the next thing we have to do is to initialize the linear regression object we can do that with just linear regression brackets so next we can call the dot fit function the dot fit function abstracts away all of the math that we looked at earlier so all we have to do is just call dot fit and it will run for a bit and then return a nice fitted line on the bottom we can do some visualization but yeah yeah so after the dot fit function was called it fits the linear regression to these two values which we picked out so from the residual sugar it plotted onto the density and you can see the line that it fitted so with the regression we can actually get the exact numbers that are found with the dot cof so in this case cof is actually a numpy array which is just a list of numbers and it's a list of numbers because this function works for multiple variables also so in this case we only have one variable so we just access the first coefficient that is multiplied by x and then we add it to the intercept which is found with dot intercept so we actually can find the score of this fitting with the dot score function so dot score just helps us to figure out how good of a fit this line is and in this case is 0.73 is 0.703 ish and the maximum score that we can get is one so this is quite good I would say and a score of zero means that it's as good as just predicting the mean value for everything so if you just took the average and that was what you fitted to every time that will give you a score of zero this is called the r-squared metric you can read up on it if you want and so we just looked at linear regression but we can also do logistic regression which recap is not really regression it's more of binary classification so again we can from sklearn linear model we import logistic regression and in this case we want to predict between two different classes so two classes that we can choose are maybe if the quality is greater than or six or less than six so that will be our target variable that we want to predict and since from the data exploration stage we knew that alcohol seems to have a high correlation with it we choose that to be our variable to look at so again we initialize the logistic regression we call dot fit again so notice that the syntax is basically the same and that's the nice thing about sklearn is that much of the syntax is the same so we don't have to focus on the math or remembering all these libraries we can just remember which algorithm we want to use in which case and use it so again after it's fit we can do some visualization and this is the this is the prediction threshold that it's giving us so for an alcohol content of less than about 9.5 it will say that the quality is smaller than six for an alcohol content that is greater than that it will say that the quality is more than six so it's predicting between the two classes which is less than or more than six and the reason that it's making this a decision in a way is well if we look at the box plots of less than six and more than six we can see that yeah there is a trend here where an alcohol having less than six is has a lesser alcohol content than an alcohol that is sorry then when it's more than six so when it's more than six it's usually has a higher alcohol content and again we can print the score and the score here is 0.69 again the maximum is one so this is an okay fit oh sorry there was a question in chat in the residual sugar that is one value at almost x equals 70 which is very far away from the rest how can you determine if it's an outlier or should be left in the data set so i see what you mean so this data point up here that is i think what you're referring to and yeah you're right it is very far away from everything else so this is where we would if we had seen this only now then our exploration stage is insufficient i guess but if we will go if we were to go back to our exploration stage usually we will have seen this data point and then we will have looked at everything else about this data point so um yeah so we will try to identify maybe is it just this one attribute that is an outlier or is it that the entire data point so the entire wine is an outlier if it's the second case then we would usually just remove it from the data set and say that this is an outlier or if it just happens to be that this one wine has a really high residual sugar but everything else is more or less normal we might want to investigate more or just keep it in the data set so really having domain knowledge as it's called which is understanding of the data in context that's really helpful for determining outliers and when we're making models and interpreting the data so okay so outliers are basically determined by common sense and domain knowledge there is no formula or something thanks um i guess there is that one formula from the box plot which is just saying that if it's 1.5 times away from the if it's 1.5 times away from the i'll go back to the box plot if it's 1.5 times away from the center then we call an outlier that is a fixed formula of sorts to tell us if it's an outlier or not but we when we're trying to determine if it's an outlier or not we really need to understand the data point we need to understand what does it represent so that is not really a fixed formula yeah okay i'll check if there's any more questions okay if there's nothing we can move on so yeah now now we can do some diy so this part is really for you to just reconcile everything that you've done i've gone through in the past hour or so so we're going to look at a new unseen data set specifically it's called the iris data set it's a data set of different types of flowers and measurements of like the petal length and the sequel length the width so there are just four attributes here so it's not going to be too difficult but yeah i want you to perform just the whole lifecycle from plotting the data from loading the data then to at the end classify between the different types of flowers so yeah if you haven't already there is the iris dot ipython notebook that is what you want to look at and so this is the notebook that we look at so it's a fill-in-the-blank kind of thing so it's not going to be too demanding basically just try to recall how and read documentation if you want how do you plot the specific plots and what observations can you make from these plots and then at the end can you make a classifier to determine sorry can you get a classifier working between just any two classes so in this case our data set i'll just run this here our data set has three types of flowers there's the satoza okay so there is the satoza there's the virginica and there's one more which is so there's 50 types of each of the flowers the satoza the virginica and the versicolor and there are 150 rows and yeah there are five columns where each one is the sepul length sepul width petal length petal width and finally the type so yeah just try to plot some plots between these different attributes there are some suggestions down here and they fill in the blank style here you can do more obviously if you want and try to write down what observations you made and at the end yeah get a classifier working just between any two classes of flowers just two because we just went through logistic regression and that is only for binary classification so determining between two attributes okay yep so i'll leave you to that i'll come back and i'll be answering questions for a while but until then yeah just work on this notebook sorry if i didn't mention this but we're gonna be doing this for about half an hour more so if you have any questions you can feel free to put them in chat too yep so there is a question in chat so for the pet plot number six how do you set the hue as the flower type i'll just quickly load up the notebook give me a second okay so for the pet plot you supply the first argument you supply is just the data frame and then the diagonal kind that is the remember the diagonal type so on the example that we had we used the kde so i'll just put kde again so finally the hue the hue we specify this by just just type the column name of the data frame that you want or you could do df then get or you could pass the series with the data that you want but since you already passed in the data frame you can just supply the column name also we'll be going through the remaining questions in a few minutes so so that's just how you get that so if you don't specify the hue yeah you can see that if you don't specify the hue then you just get this and with the hue it will separate it into the different colors depending on the different types yeah okay so i don't know if everyone's back yet but we're going to go through the exercise so yeah so yeah this was the original notebook that i asked you guys well except for this one but never mind so yeah um in the drives i'm not sure if you can see it yet but in the drives there should be a notebook called iris underscore solution or iris underscore sol um yeah this notebook iris underscore sol if you just open that that is the solution that i wrote so yeah we're gonna look at that now so yeah uh yeah the first thing we do is run it yep so we run it and yeah so this is our original data frame that we got from the url and the first thing we wanted to do was just plot the sepal length versus the petal length so logical thing to do and yeah for this we can just call the plt.scatter function and we pass it the length of both of them now uh c1 also has a scatter plot function i don't know if i mentioned that but uh yeah we can look at that later so with the the first two are more or less the same it's just the different attributes and yeah with the c1 we can actually set the hue and sorry uh yeah with c1 we can actually set the hue but before that we want to look at the box plot so sns.boxplot and x is on sepal length and y is on the type and here you guys have this plot so uh yeah you can actually see this one outlier here which is interesting and if we go back to the scatter plot with the flower type as the hue we can use c1 for that so c1 scatter plot it's slightly different name but yeah and so we set the sepal length as the x sepal width as the y and then we pass in the hue parameter which we give it as the type so on each type it gives it a color of its own and that allows us to see the general pattern between the two and just one more we also have the petal length sepal length and hue as the type again so we also have the pet plot which is a lot faster to run this time because the data is smaller and there's less attributes so yeah again we can use hue here uh i like to use it because it helps us to see the different comps and the different uh patterns in the attributes that we're trying to predict so yep so for the observations you could write stuff like um i guess um sepal width and sepal length are correlated stuff like that and if you actually look at the petal width and the petal length they seem to be more or less linear related so and yeah so these observations are just nice to list down so that if we're doing any modeling we have a reference to all the inside that we gathered from this stage and the final part that we wanted to do is to get a classifier working between two classes and uh so uh this is actually a boolean indexing i'm not sure if i went through it but basically just filters out all the iris atosas because uh yeah it's a really distinct class so it's really easy to distinguish so for a bit of a challenge it's so it's nice to do it a bit harder so if we try to distinguish between the versicolor and the virginica we're going to have a little bit of a harder time because you can see that they are not so distinct or the satosa is basically a class of its own so yeah we set the x and y as the different attributes and we call it a logistic regression so in this case i picked the petal length to be the one attribute that we try to distinguish used to predict the class because if you look at the different plots you can see that the petal length actually has a very nice line right about here which cuts the class right in two so it's a really good attribute just a single attribute for classification so yeah if we run the plot if we run the logistic regression and fitting we can get a score out and in this case it's 0.93 so that's really good and just one more visualization where i just have a bunch of inputs to the logistic regression and so if we input all the petal lengths we see that that this is how it is its decision boundary so less than a 4.9 ish it says it's a versicolor and more than 4.9 ish it says it's a virginica so this is what it is doing to distinguish between the two classes and if you look back at the original plots you can see that it does really uh yeah seem like that is how it is split and yes so there's another linear regression which uh you can do for fun if you want it but because we said that the petal length and widths were quite correlated you can just do a simple linear regression to find what this relationship actually is and yeah just one more function s and s a rack plot it's a regression plot so if you just want to get a regression plot data and linear regression at the same time you can use this function um it doesn't really give you the coefficients but it's just a nice tool plotting it really quickly and yeah you can see it has a really high score which means that it is quite well correlated linearly at least so that was my solutions at least if you have any questions you can ask them in chat so if there are no questions then uh yeah this is basically the end of the workshop uh i really hope you guys enjoyed this two hours where we went through uh the whole data science well not the whole but most of the data science lifecycle from data wrangling to exploring then up to this point which is a predict thing between the different flower types um yeah hope you guys enjoyed it and there is a feedback form which i'm not sure where it is uh i'll send it out later but uh yeah thanks for coming i guess there's one more thing which is i have to say this so if you're looking to upgrade your skills for career progression or transition but not sure where to start then you can speak with a skills future Singapore skills ambassador to identify your needs and gaps and gain useful tips and advice on how to kick start your skill search uh there's a QR code you can scan if you want to learn more so yes that it's a feedback form here in the QR code and if you would like to find out more about the wide range of training courses and resources you can access the skills future portal which is also in the QR code so yeah thanks so that's the end of the workshop uh thanks for coming