 Hi there, my name is Jean Clapper and I'm a surgeon and research fellow at the School for Data Science and Computational Thinking at the University of Stellenbosch. One of my aims is to provide educational resources for statistics for data science and machine learning. Now this video is going to be about one of the packages in Python. If you want an easy but thorough introduction to Python, I'm going to leave a link in the description down below to one of my courses that you can take to get a certificate from my research unit. One of the most common things we do is to grade models, whether they are linear regression models or logistic regression models. And for that our data needs to be in a certain format. As far as Python is concerned, I like to use a package called stats models and to get your data into stats models it really has to be in the correct format and for that I use a package called Patsy. So in this video I'm going to show you how to use Patsy just to get your data into the right format for the use in building linear models inside of stats models. Patsy is also great to transform our data, sometimes we have a categorical data and we want to do one hot encoding or we want to take the logarithm of one of our numerical variables. I'll open a Colab notebook and I'll show you just how to use Patsy. We've opened our Colab notebook design matrices. This notebook together with the data file that we're going to use will be available on GitHub and I will leave a link in the description down below. So this notebook is going to be all about the Patsy package and that is for building design matrices and then also describing linear models. We're going to concentrate on the designing of matrices. So what this is about is having a data set and importing that data set and then transforming it so that it is ready for use in other packages specifically in stats models. So think about creating a linear model, a logistic regression model, the data that you import from a spreadsheet file, we can transform that data so that it is ready for use inside of stats models and something that we have to do quite often. Now this is running in Google Colab but I've left a little note for you there if you want to install this on a local machine and also some hints to install Miniconda and create environments for your project and only install the packages that you require for that project inside of that environment. So here we are inside of Google Colab here and we see the table of content nicely neat on the left hand side and of course we don't need to do any of this installation. Let's rather have a look at the packages that we're going to import. We're going to import NumPy and we're going to use this namespace abbreviation in P and that's quite standard as well as for Pandas, this PD and then we're just going to import Patsy and then from Patsy we're going to import two functions so that we can use them directly D-matrices and D-matrix. You see the tooltip coming up there. So let's run the cell and we'll import those packages. So the data that we're going to use as I said it will be available on GitHub. For this demonstration it is on my internal SSD. So I'm going to run this code cell and it is going to allow me to import the file into this active session. I'm going to select the choose file that's going to open my file browse on my system and I'm going to navigate to that CSV file and click on it. Good that file is being imported and we can see at the bottom here user uploaded the file data.csv and you see the bite size of it there. So let's just use the read underscore CSV function, it's a function inside of Pandas and we're going to import this file and if we use the head method there we'll see that we have the first five rows by default. We see we have a categorical variable here sss. Now just go through the data set. This is a simulated data set so it does not come from real patients. So sss is skin and soft tissue infection or sepsis and that really pertains to the postoperative period patients have had surgery and they can develop wound sepsis would be the older term for that. But we refer to it as skin and soft tissue sepsis and we see it's a binary nominal categorical variable. We just see no yes, no yes there. Now what we're going to do or what the aim was for this data set and other tutorials is to build models that can predict whether a patient will develop skin and soft tissue sepsis based on these following independent variables. First is their BMI that's the body mass index. So it gives an idea of whether they are of normal built in size whether they over weight or whether they under weight. Highest temperature. So this was really a data set simulated for patients who came for emergency surgery. And in that period between admission and getting the emergency surgery what was the highest temperature they had? This would be in degrees Celsius here. What was the highest heart rate in their period prior to their surgery? So they came to emergency room with scene and decided emergency surgery was required during that time period. The temperature, the heart rate, the blood pressure will be monitored multiple times. So one of the typical things to do is just to look for the highest temperature and the highest heart rate. Pre-op WCC that's the preoperative white cell count. Your white cell count is a type of infection cell and we can see for these simulated patients their white cell count was indeed elevated. Preoperative CRP, C reactive protein is a protein in your blood. A level of which goes up during certain types of inflammation and infection. Another parameter that we use is pre-op platelets. Platelet is a type of blood cell as well and it can increase or even decrease during severe sepsis. The grade, now that is the grade of wound that we're talking about. So someone who goes for clean surgery that would be encoded here as a one. Clean contaminated surgery would be two and for instance here we think of cutting into an unaffected area but that contains organisms. For instance, the bowel. Three would then be contaminated. So we would talk here about appendicitis. So an organ is already ruptured and there are organisms lying around and then you also get dirty wounds which would be grade four and that would be wounds of a traumatic nature that there is a lot of soil or foreign body in the tissue with some divitalized tissue. ASA, that is a grading system used for preoperative illness. So other illness that the patients have and one being healthy and the higher the number goes, the more comorbidities, the more other disease they have and that gives an indication to the anesthetist of the risk of the anesthetic. And then we have obturation and that is just how long the procedure took in minutes. So just to explain this little simulated data set. So let's look at a very simple example. We're just going to use a simple linear regression model. So we're going to have one independent continuous numerical variable and we're going to try and predict the value of another continuous numerical variable. And just to remind you what these are all about, if it is a single independent variable, we have this column of ones that will be for our intercept and this column here, X sub one, X sub two, those are the actual values for each of our observations in the independent variable and we have to find these parameters, beta zero and beta sub zero and beta sub one. And if we do this matrix vector multiplication on the first line here, we're going to get the predicted values. If we choose these values, these parameter values beta sub zero and beta sub one properly. Now we're going to do that by creating D matrices. So design matrices and you see the function there from Patsy D matrices. And we're going to pass as a formula what we are after. So what we're trying to simulate here is the following. We're trying to simulate a pre-op CRP given pre-op white cell count. So think of a scatter plot and we have on our X axis pre-op white cell count and given that we want to build a model that will predict the patient's pre-op CRP. Sometimes we use CRP, at least changes in CRP to make clinical judgements in surgery, but it's not always available at all laboratories. And this simple little model that we're just using here as a tutorial, we'll try and simulate what the CRP is based on the white cell count. Unfortunately, simple models like these are not very clinically relevant, but we're using this as an example. So we would use something that's very akin to what we would use in the R language for statistical computing and that's a formula. So this one says pre-op CRP by pre-op white cell count. So the first one's the one we're trying to predict the dependent variable. And then after the little tilde symbol, you'll have to find that on your keyboards, different for many keyboards. We list the independent variables that we want to use as predictors. So we're gonna use the patient's white cell count to predict the CRP. And then that goes inside of a set of quotation marks. So that is a string. And we just have to tell this D matrices function where this data comes from. So then we're passing this instance of a data frame object DF that we assigned that import to of our data. And that is going to give us two objects. So we're going to assign it to two variables. And it's typical to use Y, a lower case Y, that is our dependent variable, that vector. And then X is going to be this matrix that contains two columns. We've only have a single independent variable, but we also have to add that column of ones so that we can have an intercept as well. And so that feature matrix, we usually use an uppercase X4, but you can call, you can use a computer variable name that might be more suitable to what you do. So let's just look at the type of these variables, the instances of these objects that we created. So for Y, we ask what the type is and we see it is a patsy dot design underscore info dot design matrix. So it is a type of matrix, it's gonna be a single column because if we do Y dot shape, if we look at the attribute, the shape attribute of this object, we see it's 281 rows along one column. And if we look at the first couple of those and we're using indexing notation here, the square brackets, colon five, that means start at the beginning, the first row and then the first five and we see 29.3. And if we go up to our data set and we look at CRP, you see the first patient was 29.3 right there, the second patient was 29.9. And if we go down there, we see 29.9. So that's gonna be a vector of our dependent variable. Let's do the same, let's check on the metadata for X. The type of X is also gonna be a design matrix, the shape of that will now be the same number of observations, same number of rows 281, but we're going to have two columns of course. And if we look at the first five entries there, we see indeed that we have the white cell counts there in the second column, but in the first column we have all ones. And that is how the linear algebra of this works. For instance, if we use ordinary least squares, we need that column of ones. Now these two objects, Y and X, they are now ready for use inside of stats models if we wanted to build a simple linear regression model. Now we can add a couple more things to this D matrices, more arguments to the D matrices function. And the first one I wanna show you as well, we might have more than one independent variable. So here would be a formula that we would use, you can see it right there. So it starts with pre-op white cell count plus pre-op platelets. And now we'll have two independent variables trying to predict this pre-op CRP. So we just use the plus symbol and we can add any of the other numerical columns. It might be that you require or be in a situation that you don't want the intercept, the column of all ones. And that's very simple to do. You just tag on a minus one at the end of your formula. And I'll refer you back again if we go up here. I'm talking about this formula, which we put inside of quotation marks, building our linear model. So we can also have this interaction of terms. We can use the colon symbol or multiplication symbol that would be star eight on my keyboard, the asterisk there. So this is how we would do it. Pre-op CRP, and we're going to try and predict that by pre-op white cell count, pre-op platelets, and then the interaction between pre-op white cell count and pre-op platelets. And we can, as I said, also use this multiplication symbol as we've done there. The last thing in this section, I just wanted to remind you that there's also a D matrix function instead of D matrices. And that's only going to give you the design matrix as far as the feature variables are concerned. In other words, our column of ones, unless we want to omit that column or that intercept column and just tag on minus one at the end. But it's still just all the pluses and still put it inside of quotation marks. But now we have only the D matrix function. So we're only going to assign that to a single variable X. Now, that is to get our data into the correct format for use inside of stats models, for instance. But we can also transform the data before you use it inside of a model. For instance, we might want to take the natural logarithm of one of the variables. And as very simple, we're just going to wrap our variable inside of a function. In this instance, we have the numpy function log that would be in the natural log. Perhaps it'll be more useful just to transform with log base 10. Irrespective of how you want to, what kind of transform you want to do, you can add this function right inside of your formula. So it's still going to go inside those quotation marks, but it'll just be a normal function here in this instance of function from the numpy library. So very simple to do. Now we've shown interaction, but we might want to create a brand new variable that does not exist in our dataset. For instance, in this case here, we want to create a new variable which adds the patient's white cell count in the op duration, the duration of the surgery. So that's not a very natural thing or realistic thing to do, but this is only for demonstration purposes. The problem is if you want to add two variables to each other, we're going to have to use the plus symbol, but that is telling Petsey that we want to just add more numerical variables to our model. So we wrap this inside of this I symbol. So let's say I and then in a set of parentheses, add these two. So for every patient, those two values are going to be added and that is now a new variable inside of our model. And so very useful thing to do, but do remember to wrap that inside of I. And if we show that, now remember this was D matrix. So we're only going to see the feature variables. There's my column of ones. Now we see this addition column and then there's the CRP column for instance. So an example there. We can also do statistical transformations. One example would be to standardize all the values. So to standardize it, you can take every value you subtract from at the mean of that variable and you divide by the standard deviation. So again here, we're going to use D matrix. So we're only going to have a feature matrix here and still inside of our quotation mark, we'll use the keyword standardize and then pre-op white cell count. The data comes from the D of data frame and let's look at the first three values now. So instead of pre-op white cell count, we'll now see a standardized version of this. We can also center the data. That is where we just subtract the mean from every value and the keyword for that would just be center. And here we just center all the pre-op white cell counts before we import them. Now we can also even create our own functions and change our variable. And what we're trying to do here is just to convert the minutes. The opt duration was a minutes. Let's convert that to hours. I'm creating a function here. So using the def keyword there, convert underscore hours with a single variable and we're going to return x divided by 60. And we use it right inside of our formula here. So D matrix and then pre-op white cell count plus convert hours, opt duration. So running this function, we see that we have pre-op white cell count there but then the opt duration, that is expressed in hours now. So it is a user defined function that we can use right inside of our formula. We can also transform our categorical variables. And what we can do here is just to look at the grade. I remember those were encoded with one, two, three. So if we look at the data type of that, of course those are going to be 64 bit integers. But that is just an encoding of a categorical variable. So let's just see all the unique values, the sample space elements. We see one, three, two, four. So it just takes them in order in which they were seen. So there's four. And what we can do here because it's a categorical variable, we can use something such as one hot encoding. So let's do that. And the way to do that is with the C keyword here. So I'm gonna take grade and I'm going to pass it to this C keyword. And if we look at those, we see instead of just grade, we now have three columns. So remember this first column was all ones, that's where I intercept. Then we have three columns representing these four grades. And then we have the pre-op white cell count in the final column. So this is one hot encoding. You might be asking, why are there only three if they are four? Well, what happens is the first one gets thrown out. We don't need four values to represent these four sample space elements. We only require three. Because if we look at the first patient, the first patient had a grade of one. Now these columns would be for two, three and four. And if they are zero, zero, zero, it means zero for two, it's not two. Zero for three, it's not three. Zero for four, it's not four. Well, the only thing that remains is one. And of course, when we've got the logistic regression models, we want to compare all the others to the base case. So in this instance, it works well for us. So you can see that you only need three values for four sample space elements. Now the order, we might just be lucky in the order in which these values appear in our dataset from the first row down. So it's best if we just specify the order. So just to show you that first patient, the grade was indeed one. So let's specify an order. And in this instance, I might want four to be my base. So I'm just going to assign this Python list to a computer variable L and I put four, three, two, one. And I can now use the C keyword again and I, using the grade independent variable, but now I'm setting levels equals L. So if we now look at it, the columns would now represent three, two, one in that order. The first one is always removed. So remember the first patient had a grade of one. So if these represent the columns, remember that first column of ones, that's just our intercept. So this would represent three, two, one. And indeed there's a one inside of the one column. And this is one hot encoding. So only one of those will be one. And it is in the ones column. So we know that first patient had a grade of one. So it's very easy to do one hot encoding. So that is it, the short of, short version of what you can do with Patsy. And it really is a great tool to use, a great package to use just to get your data, transform your data, get it ready to be in the correct format for use inside of a package such as stats models, where you can do your linear regression and your logistic regression.