 Hi there everyone calls me Professor Jay. I am from the Department of Biostatistics and Bioinformatics at the Milken Institute School of Public Health at the George Washington University. Now this is a short video tutorial on exploratory data analysis which means summary statistics and data visualization using the chat GPT large language model. Now this video tutorial is specifically for our postgraduate students in the school especially those that are planning to or have taken my postgraduate class in introduction to biostatistics but this tutorial is open to all of our students at the school and at GWU. And as always I extend an invitation to anyone interested in the Korean public health. If you want to study at one of the leading schools of public health in the US or indeed in the world why don't you consider GWU here in our nation's capital Washington DC. I can hardly think of a better place to start your career. Now I will leave information about our school in the description. The plan in this tutorial is to use the GPT4 model with the code interpreter from chat GPT. Now this requires chat GPT plus which is the paid subscription. In a later video I will take another look at a large language model that is available for free. Now for this video though I will start by uploading a spreadsheet to chat GPT and then show you a variety of prompts that you can use to perform EDA using a large language model. We will take a look at summary statistics, comparative summary statistics and data visualization. Now stick around till the end because I have included a sneak peek into how you can set things up so that you can perform statistical tests in chat GPT. To get us started we'll look at the data set. Now this data set is available on Kaggle and in other repos on the internet. I've simplified it a bit by removing some of the columns just to keep it easy because what we're trying to do is just some exploratory data analysis. I'm going to call it just EDA. As mentioned this is a simplistic data set. If you want to learn more about the diagnosis of coronary artery disease which you'll see is our last column here whether someone has heart disease or not. If you want to learn more please have a look at this link down below and you can read up a little bit more about the proper diagnosis of coronary artery disease. So we're just going to use this data as it is quite simplistic view of it but we just want to do some simple EDA and have a look at how to do it using an LLM. So a couple of things about this data set. First of all I've noted there for you that this is in tidy data, tidy data format which means we have very well defined what a row and what a column is. So along all the rows each separate row, a row of these would be a single observation. That's so a single participant in the study and then for each of these variables we'll have a random value for a very specifically defined variable. So each row is a separate observation. Each column then is a separate and individual variable. So we see the names at the top. The column headers age and then binary six cholesterol resting ECG, max heart rate, exercise and gyna and heart disease and then for each of these columns a very specific variable, specific data type, defined sample space. So if we look at exercise and gyna there's only n and y. That's the only values that are possible if we look down the age column. It's only integer values so we won't find anything like a string or words or sentences in that column. There's only integers allowed. For the next one binary six there's only m and f for male and female. There are no numbers in there. So it's a very well-defined sample space, a very well-defined variable and a very well-defined level of measure. So the age column for instance that would be continuous numerical whereas for the binary six well that'll be a nominal categorical variable. So these things are very well-defined defining for us this idea of tidy data. The next thing that we'll notice is if we look at these column headers, cholesterol resting ECG, that there are no illegal characters. Now the most common illegal character would be a space. Now that might look nicer if we had resting spaced ECG. That might look nicer and certainly if we want to create plots we certainly want that space in between but if you've got any kind of experience with Python R or any other programming language you'll know. If you want to do this kind of data analysis it's much better if there are no illegal characters inside of these column headers. So that's just something to be aware of but as I mentioned if we do want to create plots we'll have to fix the fact that that looks about ugly just max HR and you want that to say maximum heart rate. So let's have a look at these seven variables that we do have here. In our first column it's an age variable with the name age and as I mentioned there's only natural numbers in there but it does define for us as far as level of measurements concerned that that is a continuous numerical variable binary six. So simplistic view of gender there and as much as it's a binary variable there are only two unique values in our sample space there's either f for female or m for male. So just a very simple data set. In other words cholesterol that's serum total cholesterol that's measured in milligrams per deciliter and you can see the values there once again they natural numbers but that does define for us a continuous numerical variable. Next one is resting ECG that's a multilevel or nominal categorical variable. They are three levels so three unique values there are in other words three classes or three levels these unique values in the sample space of this variable and they are normal for those participants that just have a normal ECG electrocardiogram ST for ST segment changes now if you know anything about an ECG that might indicate that there is some ischemia on the heart there's not enough oxygen supply to the heart muscle itself and then LVH for left ventricular hypertrophy and that is enlargement of the left ventricle so that can be at least looked at as far as an ECG is concerned giving that idea from the ECG that there is LVH. So those three classes or three levels there for resting ECG then max HR maximum heart rate that's the maximum heart rate achieved by each individual when they were stress tested and stress test is a exercise induced stress so usually walking or cycling in in in a lab allowing for the measurement of the maximum heart rate and that can indicate various problems. Next one is exercise induced angina now you're going to either ask the patient you know whether develop angina when they are physically active or once again in the lab on a treadmill or on a bicycle see if they do develop a pain across the chest that's exercise induced angina when they are physically active and the sample space there is yes and no Y4S and for no so that makes it a binary categorical variable as well and then finally we have heart disease and that's encoded with a zero if there no heart disease and one if there is a heart if there is heart disease present and that makes it binary categorical that is not any medical variable that's just encoding so we've got to be we've got to be careful there again it's not the simple we won't just use those six variables to decide whether someone does have heart disease so please have a look at the link that I've put down here below if you want to learn more about the proper diagnosis of heart disease but this is our data set so we're going to use an LLM and we're going to do some EDA exploratory data analysis so here we go we've chosen chat gpt for this demonstration and that is at chat.openai.com of course if you want to sign up there's a whole sign up process they are going to ask for your telephone number etc and as you can see here this is chat gpt plus so the plus makes this gpt4 model available to us with all the plugins and code interpreter but that does cost the money it costs $20 a month and if time allows I will make a video where we'll look at some of the free options like clod.io but here we have gpt4 so this is in chat gpt plus costing $20 a month so if I hover over gpt4 you'll see it says there that it is available exclusively for plus users and then gpt4 currently has a cap of 50 messages every three hours even though we are paying for it and then there's the default setting the code interpreter setting which is the one we're going to use and you can see it already has a check mark so just click on code interpreter and then there are an abundance of plugins that you can use as well but for this demonstration of eda we're going to choose the code interpreter and that is then already selected so right at the bottom we can also see this is the august third version of the gpt4 model and we can see our little box where we're going to interact with the large language model you'll see the little plus symbol there for if you hover over that it says upload file and that is where we're going to upload our data set so in this instance for chat gpt4 model with code interpreter you have to have the csv file the spreadsheet file on your local system so that you can upload that one thing I didn't mention when we had a look at the at the data set itself that was saved as a csv file so from any spreadsheet software you can save in the proprietary format of that software or you can choose to save as a csv file that stands for comma separated values and what that does is just strips away all the extraneous niceties that we like to add to our spreadsheet files by you know taking the dates and putting them in a certain format adding dollar symbols to anything that is financial data etc just strips away all of that so we're working with the pure values themselves and I always suggest that you export your spreadsheet files as comma separated values so that is the file format that we're going to use here and upload so I'm going to click on upload which will bring up a navigation system on my explorer on my system here and I'm just going to select that file click on it and upload it here so there you go it's hotlm.csv a link to this file will be in the description so you can use it yourself and then we can just type our message there so this file is already uploaded once you click on that upload depending on the size of it you'll see a little blue spinning disk and you know that indicates that the file is being uploaded as it stands the file is already uploaded now we can just hit this little play button here or sending our message so that the llm can interpret this file and it'll do a pretty good job it really looks at the data type for each of our columns have very nice if your data is in long format tidy data tidy data format makes it very easy for for the code interpreter to analyze your data and your data should be in that format anyway for proper analysis but I like to do something else I like to put a short little message there telling GPT-4 about my data set so there we go I'm going to copy and paste a set of prompts that I have pre-written and we'll go through each of those I'm just only doing that just to save time so you don't have to watch me type but we'll go through these as I mentioned you can just hit the same message without typing any of this once your CSV file your data file is uploaded and as I said chat GPT will do a pretty good job of interpreting what your variables mean but I do like to write a proper prompt and give it as much information as I can now I would give even more information than this but this is just for demonstration purposes so let's go ahead so I've said yeah the CSV file contains seven columns and then I I talk a little bit about each of these the age describes the age of participants in years binary sex describes the gender of each participant with two classes m4 male and f4 female cholesterol is the serum cholesterol in milligrams per deciliter resting ecg is a multi-level variable with classes normal for normal ecg st for st segment elevation or changes and lvh for left ventricular hypertrophy max hr is a continuous variable measured in beats per minute describing the maximum heart rate reached during exercise exercise angina is a binary variable with two classes n for known y4 yes and describes whether angina was induced by exercise and heart disease is a binary response variable with a zero for no heart disease and a one for the presence of heart disease so I would typically do a little bit more than this but as I said you don't need to do anything at all when you just upload this file so there we go I'm just gonna hit send message and there you can see the response it says great let's start by loading the data from the CSV file to inspect its first few rows this will help us get a sense of the data structure and it'll do that you know without us having given any prompts and it'll do a pretty good job but now that I've given it a little bit more it should know a little bit more now the other thing that you'll see is finished working show work and if I drop down you'll see the actual code in the results so it's going to show you the actual code that was used so here we see it's always going to be python by default here and uh so you can actually copy and paste this code and put it in your IDE that's an integrated development environment I like to use visual studio code or you can use it JupyterLab there's various software that you can use to type in your python code but it's going to give you the actual python code that it used and the output of that python code so it's an excellent way to learn python is just to use a large language model so this is not going to work for our you know other some of your other favorite languages for your data science and I suppose it just underscores the popularity of python when it comes to data science now it's going to tell us a little bit more about these as I said if I had put in no prompts it would have it would have given this same same set of results here that we can see for these seven variables but giving it that little prompt maybe later on it'll help the large language model make more sense of the analysis so that's it our data set is now uploaded it's very simple I clicked on the plus behind the scenes I just navigated on my internal hard drive on my computer to where the CSV file was and I just added a little bit of a prompt and just sent the message and the data is now uploaded so let's go for our next prompt so here we go I have typed in provide summary statistics for the continuous numerical variables include the number of observations the number of missing data the mean median standard deviation variance minimum maximum range quartiles and interquartile range generate a table of the results so that's my prompt I could probably just have said provide summary statistics and that's it or at least summary statistics of the continuous numerical variables I would not have to put in all of those specific statistics that I'm after for the sample but it helps if you guide it this way and be specific about the summary statistics that you want let's see what the LLM does you can see here the table is being generated line by line we get a little message there at the bottom it says it appears that there are no missing values for the continuous numerical variables in the data set so that's brilliant and there we see there's our metric count mean standard deviation min the first quartile the median or then the second quartile third quartile maximum variance range IQR missing so put it in a nice little order there for us as well and we can see age cholesterol and maximum heart rate so we'll see there's 918 observations for each we can see the mean age was 53 and a half years with a standard deviation of 9.43 or variance as we see at the bottom which is the square of the standard deviation 88.97 so very nice table and we can very quickly get some idea this is EDA we need to understand our data in the first step of doing that is just to print out or determine these summary statistics and how easy was that I typed in a line of a couple of English sentences I should say a little paragraph of English sentences and there's my result once again I can click on show work and you can see there what was done this is the python code for producing this result here at the bottom and you can see that you can just click copy code and as I mentioned before if you go to your IDE you can just paste that code run it and it'll run because here we see the output of that code which was then used by chat gpt just to create this beautiful table that we can see here so very nice there's your summary statistics you have a very good idea of our continuous numerical variables so here's my next prompt I have typed in created table of the frequency and relative frequency of the classes for the categorical variables so here we don't want you know it's impossible this is a categorical variables so we can't have mean and quartiles and variance etc we want the frequency and the relative frequency and as mentioned there I want this in a table as well and there you go instead of a table we've actually got a little bit of a list of you here and there are variables it's binary 6 it's found m and then for male f for female we see the frequencies there 725 male and 193 female and with a relative frequency or proportion then that's 78.98 male and 21.02 female so we'll see that for resting ECG as well normal lvh for left ventricular hypertrophy st for st segment elevation and probably st segment changes would have been more appropriate but look it is drawing that initial information that I gave about the data set it's taking some of that information to be able to write this response for us exercise and gina no and yes and then heart disease one presence of heart disease zero no heart disease so again using some of that information that's why I like to give it information about my data set when I do the first upload so now we've got a pretty good sense of our numerical variables and our categorical variables so with the cda we start to develop a good understanding of of our data set now we can get a little bit fancier than this so what I've done is I've included all the values for these columns for these variables but what if I just want to do my summary statistics but I want to filter for some of the specific classes for instance for one of my variables my categorical variables so let's have a look at this what I've typed here I have typed create a table of the summary statistics of the numerical variables used before but filter only for those with a normal resting ECG so once again just a normal English sentence I didn't use any of the normal or st lvh the the abbreviations that were used as as the names for those unique values or classes in that variable no this is a normal English sentence I want a table of the summary statistics of the numerical variables used before so it's going to give me all those statistics but I want to filter only for those with a normal resting ECG so let's just go back up here and if we look at our results here resting ECG normal just have a look at that frequency there were 552 instances or observations then that had a normal resting ECG so let's see what happens with this prompt and look at that it says similar to the entire data set there are no missing values for the continuous mega variable in this filtered subset if we look at the count we'll see 552 so really that was correct there were only 552 observations that had a normal ECG and now I'm going to get the summary statistics of my numerical variables age cholesterol and maximum heart rate but only for those 552 participants I'm excluding all the others so you can start to see how specific you can get by filtering some of your data and then redoing your summary statistics so you can get a real good sense of what is going on with your data and I really can't emphasize this strongly enough if you want to start to understand this information the knowledge the story that's hidden in your data summary statistics is just a the proper way to start looking at your data so up till now we've only looked at a single variable at a time so we've included all of the values so the whole data set for our summary statistics both for our continuous numerical and for our categorical variables and we've seen how to select or filter out only a subset of our participants and so that's great if we just look at a single variable at a time called univariate summary statistics but we can also do comparative summary statistics where we're going to compare two or more groups to each other and we do remember that one way that we develop these groups is just to group them by the classes for one of our nominal categorical variables so what I've typed here is use the same summary statistics for the continuous numerical variables as before but only for the age column so I'm asking for all those statistics but only for the age column and group the results by the classes of the response variable now remember I told it right at the beginning that heart disease is the response or the target variable or the dependent variable so use no heart disease for zero and heart disease now let's just change that that should be a one and heart disease for one as row values for the two classes of the heart disease variable so let's have a look at what the large language model does with that so remember what I'm trying to do I'm trying to create summary statistics of my age column but I'm doing comparative summary statistics I'm going to divide all the values into two groups and those two groups are the two classes that it finds in the heart disease variable so here we go here's the summary statistics and as I said I wanted two rows the zero for no heart disease one for heart disease and that's exactly what it's done so instead of putting a zero and a one there it's actually put the words no heart disease and heart disease so there were 410 with no heart disease 508 with heart disease and as far as the age column is concerned we see the mean and now we can see that those with heart disease were slightly older they had a mean age of 55.9 versus there were no heart disease of 50.5 and that's what we mean by this is comparative summary statistics so if we know that we have a bunch of research questions where we're going to make comparison between two groups and the two groups are the two classes of our target or our outcome or response variable our dependent variable it's very good if we do this to start off with if we do this comparative summary statistics now we need not only do that for our continuous numerical variables but we can also do this for our categorical variables so I have typed create a contingency table using the resting ECG and heart disease columns include row and column totals in the table also generate a table of expected values under the null hypothesis that the variables let's put a little s there are not associated with each other so this is comparative summary statistics as far as frequencies are concerned and that's what a contingency table is going to give us now along the rows we'll have one of the variables the classes of the one variable and across the columns we'll have the classes of the second so this is also a form of comparative summary statistics so here we go remember there's always the python code to show you how the results would be calculated if you would use python code but here we see our results so along the rows it's put the resting ECG classes lvh left ventricular hypertrophy normal or st and across the two columns no heart disease and heart disease so if we look at those without heart disease 267 of them had normal ECGs 61 had st segment changes and 82 had left ventricular hypertrophy and we see the different frequencies when it comes there to to those with heart disease and then we can also see the row totals and the column totals so with no heart disease once again they were 410 and with heart disease there was 508 and then under the null hypothesis remember these two variables resting ECG and heart disease are independent of each other and if you remember a bit of your probability theory there's a very easy way that all of these expected cell counts can be calculated and the easy way to remember that is if we look at this first cell where we get 83.97 that is in this lvh row and in this no heart disease column so we're just going to multiply these two that 188 which would be the row total times the 410 which would be the column total for this specific cell this first cell and then divide by the sum total 918 that gives us 83.97 but it's best to to think about this in terms of independence of these two in other words our joint probability must be equal to the product of the two separate probabilities as far as the two things that we would put into the joint probability so lvh and no heart disease and so here's our expected frequencies we see that all of them are five or more and from that we know that's at least one of the assumptions that we meet for the use of a chi-squared test for independence or Pearson's chi-squared test so there's our summary statistics and then also a very nice table of expected values now while summary statistics are great specifically comparative summary statistics it's even better when we visualize our data so what have i asked for here create a histogram of the age column use the title and then i've used the quotation marks there distribution of participant age also use the horizontal axis label age in yours and the vertical axis title count use a light orange as the bar color create bins with a minimum of 20 and a maximum of 80 with a step size of 10 so what i'm trying to showcase here is i can be very specific about the styling of my plot i can tell it what kind of titles axis labels or axis titles i want and also as i said the color and the actual bin size and what i love about this response is that was actually a mistake in the code here and you can see in red there there's all the error there's a using a string of single character colors as a color sequence is not supported so it actually you know executes this python code finds an error and then tries to correct itself and then you can see the actual code where the correction was made so that we can get the proper plot and there's our plot i've got the title that i asked for i've got the axis labels or axis titles that i asked for and then the color that was orange and we can see the bin sizes it was from 20 to 80 and in step sizes of 10 so i can be very specific about this and that's going to change the code for me look for instance there it says bins equals range 20, 81, 10 and that very specifically defines these bins so don't leave it just to the default i can be very specific it's going to generate that python code for me but remember this is using a large language model so if you're not interested in the code we can just forget about the code all we want to do here is just look at the eda the exploratory data analysis and we can see that age is kind of normally distributed and we can take a lot away from that as far as the majority of cases in this decade 50 to 60 and very few people on this younger age group says a little bit of a left tail i suppose included there as well so that's great for histogram let's plot some more create a scatter plot of age and max HR add the title scatter plot of age and maximum heart rate use the horizontal axis title age in years and the virtual axis title serum cholesterol and milligrams per deciliter group the markers by the classes in the heart disease column add a legend so i'm not only want to scatter plot but i want to scatter two different scatter plots you know on top of each other so i want that separation that group by the the classes in the heart disease so the two classes those without and those with heart disease and i want to see them separately add a legend named heart disease with the class names no heart disease for zero and heart disease for one and also insert grid lines so very specific about this plot that i want now you needn't be this specific you are still going to get a plot and this is all eda so i'm just trying to learn about that knowledge that story that's hidden in all the numbers of my spreadsheet but i've asked it to be very specific and that's what it's done so look at that you know there's sort of a little bit of a negative correlation there as people get older their maximum heart rate goes down a little bit and we can see there the little blue markers for no heart disease and the little red markers for those with heart disease so a lot of information once again that i can get from the scatter plot now in future tutorials i'll show you that you can then indeed now do some or calculate at least the correlation between these two and calculate the correlation for each of the two groups those without and with heart disease but this tutorial is all about just the eda so we've created a nice histogram and you can go through all of your continuous numerical variables and create this histogram you can also then you know create two histograms so that you do summary data visualization or comparative data visualization i should say for those with and without heart disease for instance so try out different prompts and really ask this llm to create plots that you want so that should give you a proper idea of the power of using an llm to do your eda your exploratory data analysis it is so easy to do you can be very specific you need not be as specific as i was here but you can learn you know all that initial information start to get that story out of your data by comparative summary statistics and data visualization really powerful here now as a little sneak peek not part of eda but if you want to explore a little bit further have a look at what i've typed here so create separate lists for the age values for each of the two classes in the heart disease column determine if the data meets the assumptions for the use of an equal variance t-test to determine if there is a difference in the mean age values for each list so think about your classes as far as by statistics we're concerned one of the first statistical tests that we learn about is a t-test and that is where we compare the means of a continuous numerical variable the same variable between two different groups now if you want to use a t-test and remember there are various t-tests if i want to use a t-test i must make sure that my data meets the assumptions for the use of that test so that's exactly what i've asked you i've told chat gpd what analysis i want to do i want to compare the means of the age values for these two groups people with heart disease and without but first let's make sure if the assumptions for the use of this t-test now specified here an equal variance t-test so that means it's also going to include those assumptions of you know whether the variances are equal so let's have a look at this responses and i think it's absolutely brilliant because remember in many postgraduate introduction to uh by statistics for public health research we're not necessarily going to look at some of these tests we might just mention these assumptions and for the examples that we use we'll usually say that the data you know do meet these assumptions but it actually you know gives us the important ones here it says independence of uh independence of observations now it cannot determine that from the data that that that we gave it but it can say here the data should be sampled independently from the two groups this is typically an experimental design issue rather than something we can test for so you've got to know that about your data normality each group should come from a population that follows a normal distribution so age should be normally distributed in the population from which the sample was taken and it says this can be tested using methods like the Shapiro-Wilke test now you can read up on these tests you know how good they are how well they perform and criticism of these tests you can read all about that but it's going to tell you that this would be a possible test that it could use and they should also be homogeneity of variances now the variances of the two groups should be equal this can be tested using Levine's test or Bartlett's test so you know wonderful things that you can read about here and even if you didn't cover this in class well you can use the LLM to learn about this first let's create separate lists for the age values for each of the two classes in the heart disease column and once again there's your code it'll show you in Python how to do that if you're interested in that let's just carry on then with the results the normality test told us it used the Shapiro-Wilke test and for participants without heart disease it had a p-value of 0.00483 now if our level of significance was 0.05 this will be less than 0.05 in other words this suggests that the age distribution for the participants without heart disease might deviate slightly from normality and when we look at the distribution of ages in those with heart disease the p-value is very very small and suggesting that the age distribution for participants with heart disease deviates from normality so we are we are not fulfilling or meeting that assumption for the use of students t-test so we're in trouble here with the homogeneity of variance Levine's test that was also below a level of significance of 0.05 and that suggests that the variances between the two groups are not equal so if we wanted to use a t-test we would have to consider an unequal variance t-test and all that information is there now it's going to say given these results the data does not meet the assumptions for an equal variance t-test specifically neither groups age distribution appears to be perfectly normal and the variances of the two groups are not homogeneous or homogeneous I should say in light of this it would be more appropriate to consider Welsh's t-test so that would be for the unequal variances or better still a non-parametric test such as the man Whitney you test if we want to compare the ages of these two groups now once again depending on your course you might cover non-parametric tests but certainly then the man Whitney you test would be used as a non-parametric test instead of a t-test as I mentioned this is not what this little tutorial is all about it was about edae but here we go I couldn't just you know it's told us now to use the man Whitney you test so let's just ask it to do the man Whitney you test so please perform a man Whitney you test to compare the two lists of age values the null hypothesis that there is no difference between the two lists and I'm just saying list list of values and the alternative hypothesis is that there is a difference between the two lists use a five percent level of significance write a full comment of the result so please be creative in what you type here but try to be as specific as possible just to help the large language model you know do what you ask and now it's written there for us the man Whitney you test returned a p value that is basically zero given our level of significance of five percent alpha equals 0.05 the p value is significantly below the threshold now the comment on the result the man Whitney you test revealed a statistically significant difference in the age distribution between the participants with and without heart disease specifically that p value is less than five percent therefore we reject the null hypothesis and conclude that there is a difference in the age distribution between the two groups so wonderful so now think of all the other things that you learned about in your postgraduate by statistics program or lectures or courses and you can well imagine that you can do everything that you learned about here using a large language model okay and that's it for this video I hope you've enjoyed it remember if you have not joined our school follow the link in the description to learn more about the george washington university you will learn a lot here that's a promise