 So up till now we've looked at ChatGPT+, which means you need a subscription. So are there any free tools out there? And indeed they are. So the one that we're going to look at in this tutorial is called clod.ai, clod.caud.ai. So it looks very much like ChatGPT. We can upload a file, a CSV file directly from our internal drive. And let's see if we can perform some exploratory data analysis. Now I'll stick around to the end because as per usual, I'm not only going to do exploratory data analysis, but I'm also going to do one statistical test just to see if clod.ai can deal with it. So before we start, here's the website of the Milken Institute School of Public Health. Visit the website if you are interested in any of the courses of programs that we have on offer in this fantastic School of Public Health. And here we are at clod.ai. Now I've already signed up. So the first time you open this page, you're going to be asked to sign up for a free account. Do so. It's quite easy to do. And when you open this page again, this is what you're going to see. Meet clod. Now this is a very nice interface, I think. Looks slightly better than ChatGPT in my opinion, but everyone would have a different opinion on that. What we do see immediately though is that there's a button there that we can link a file to. So if you have your CSV, your comma separated values file, which is a spreadsheet file, or we save your spreadsheets as CSV files, we can upload it directly from our computer, which is what I'll do right now. So I've selected the file on my internal drive. It's called hotlm.csv. There'll be a link in the description down below for you to get the same very simplified file. And of course, it is available on GitHub. So as always, as I've mentioned in all of my other tutorials using ChatGPT, is that I like to give some metadata information about the data. Now it's not strictly necessary, but I think it's always good just to provide some information. Now you can run it as is, and you should actually try that. And you'll see that Claude's actually going to generate quite a bit of information about the data. It's going to guess at what the independent variables are, what the dependent variable is specifically, and it's going to give you some information. But for our purposes here, as I say, I always like to generate this metadata and tell the large language model what the data is all about. And I'll be very verbose when I do do this. And so this, you know, I typically write a lot more than this, but what I've done here briefly, it's the same prompt as I used in the ChatGPT tutorials. It's sort of a simplified version of me giving some information about each of the variables. So we've got our file uploaded there, and we're going to give some information to Claude. Now what I've noticed before is sometimes Claude will not read the whole dataset. And there we go. Claude.ai is starting to respond. It says here are a few insights on the heart disease data. By the way, look right at the top. This is what it's going to call the chat. It says heart disease prediction model says v file analysis, and that is going to be added. It's going to be saved. So when you open Claude.ai again, you'll see a list of your previous chats. And so Sydney, this chat will be there now. So there appears to be an imbalance dataset with more instances of no heart disease, zero value than instances for heart disease. That's the one encoded with a one. The imbalance should be considered when training machine learning models. The day contains a mix of categorical and continuous variables. And so it does a very good job of giving you some insight. And you can read through Sydney read through all of this. And I think it's that's that's quite a promising start. So now I've tapped in how many rows and columns are there in this dataset. Let's just make sure and and see if Claude.ai indeed read the whole file. And it certainly seems so. It looks like there are 918 rows or observations and the seven columns. And it gives us the names of all the columns, the names of the variables. And remember up here, it gave us some idea of what it thought was the categorical variables in which were the continuous numerical variables. So, so far, so good. As I said, I've tried this file before and it only read about the first 303 observations. But now just adding that single line just telling it in my short little description. And as I said, this is one of the added benefits of giving it information about your data set that I said that they're 918 observations. So, so far, so good. Let's carry on. Now I've asked calculate summary statistics of the cholesterol column. Let's still see if it includes all of the observations. I've asked for count mean, median, variance standard deviation, minimum, maximum range, quartiles and the interquartile range. And yes, it seems as if it's found all 918 values. We can see the mean, the median, the variance standard deviation, minimum, maximum range. And we can see immediately that there's some issues. There's some values that are zero. And of course, we will have to go in there and eventually try and fix that. For now, though, let's see if we can create a histogram. So I've typed in create a histogram of the cholesterol column and look what Claude.ai has done. It's generated instead of the actual histogram. It's generated some Python code for us by default. And there we go. Now you have to know a little bit about Python to know whether this is going to work or not. Certainly it's imported. As far as I can see, some of the libraries or packages quite correctly, we are going to use mathplotlib, which is one of the many, many, many plotting packages in Python. They've also imported pandas, which is a package in Python that deals with the import, wrangling, and use of data sets such as spreadsheet files. And one of the packages, I suppose, mathplotlib and pandas, some of the packages that are the reason for the success of Python in data science. Now it uses one of the functions here, read underscore CSV, and it just references the file directly. So you will have to know where you save this Python file. And as it stands at the moment that this CSV file is in the same folder or directory as your Python file. So as it stands, it's probably not going to work. So what I'm going to do now, I'll just create a folder and I'll save a Python file in that folder. And I'll also put this hard LLM CSV file in that same folder. And we're going to see if this code actually executes and gives us a histogram. So this is my favorite IDE integrated development environment where I would write my Python code, my Julia code, my R code. And this video is not about this. I just want to see if the code executes to show you. So I've created this little folder on my internal drive LLMs for data science. And you can see I've got a Python file there. And there's my hot LLM. So my Python file and my CSV file are in the same folder. Here's my Python file. Now I'm just going to use a hashtag and 2% symbol so that I can create these cells such that these cells can just be executed one at a time. Now over on claw.ai, let's just make sure that we've copied that code and let's paste that code right here. Now I'm going to hold down shift command P because I just want to select a Python interpreter and I'm going to click on data science. Don't worry about any of this. We can make separate tutorials if you want about setting up an IDE like this. If you want to, I just want to see if the code executes that claw.ai has given us. So let's run this code. And lo and behold, we have our histogram and then we can clearly see the problems that we have with the data that there were lots of values here near zero. And so there were certainly values that were encoded with a zero if that data was not available and that's certainly something that we will have to remove from the data. We might impute those missing values or we might just ignore those observations completely. But lo and behold, let's go back to claw.ai. This code that claw.ai generated actually works. Now that's one difference between claw.ai and say for instance chatGPT. ChatGPT was actually going to give you the histogram here but what claw.ai does generate some Python code for you. Now it does give us some information at the bottom. It says the histogram shows the distribution is right skewed with most values less than 300 milligrams per deciliter but a long tail extending to higher values. The peak frequency occurs between 20 to 150 milligrams per deciliter and et cetera, et cetera. So it gives you some information but at least the code the code is there for us to generate this if you are so inclined to use Python. So what if you don't want to use Python? Maybe you are a use of R. So I've typed in here generate R code to create a histogram of the cholesterol values. And once again, it's given us some information about the results that we asked for and it's generated some code. So I'm going to copy this code and here we are back in Visual Studio Code. I've got a QMD file, a quarter document and we can type in some R code here. So I've got to tell it that I'm writing R code so this is how you would go about that. And let's just paste this code in. So it's going to load the data it seems and then it's going to create a histogram using the hist function and it's going to add some vertical lines for us. That looks interesting but as always let's just see if this code actually executes. So I'm going to click on run cell and lo and behold, let's create some space here. There we go, there's our histogram and it's actually added some vertical lines for us and it looks like in blue we have the mean and in orange we have the median. So that was quite good. Now there's obviously a few problems here the text is way too small both for the access values and the access labels the overall plot label is not big enough but if you know R code this would be very easy to fix. So what we've seen here is instead of giving us the histogram claw.ai can generate some Python code or that will be the default or if we ask for some specific code in another language and I don't know the languages that it can work with but certainly Python and R as working data science those would be your go-to languages especially Python I suppose that's why it defaults to Python but there you go, it can also generate the R code for us. So since this is a tutorial we want to make sure that this claw.ai actually works I'm just going to stress test it a little bit I'm going to ask for a specific package so I'm going to say use the seaborne package to create code for a histogram with a kernel density estimate of the age column add a title, add a horizontal axis title and a vertical axis so here I've used the word title here I've used the word label so just mixing it up a bit let's see what it comes up with and here we see the results I think that looks absolutely beautiful that's even going to tell us here at the bottom you know what the code is supposed to do so you can Sydney Pausen have a look at that I'm going to copy the code and I'm going to take it across to my Python file here I'm going to paste it and let's execute the cell and see what happens and there we go it's actually done the histogram it has generated the axis labels, axis titles it's generated the overall title I can see the grid that I asked for we can see the histogram and we can see the kernel density estimate so that works absolutely that's spot on that's a very nice plot and it's done what I've asked it to do so really if you want to do your summary statistics I think you've just got to be careful as to how to go about it so be very verbose and specific tell it something about your data set just to make sure that everything is imported and of course go back to your actual spreadsheet file and see that it is imported which vaster to import it seems to work as far as the summary statistics is concerned so please give that a go try and be more specific ask it to filter out specific observations and then calculate the summary statistics for a variable play around with it and see if that works for you just as an added bonus I thought what we could do is just to try and do a statistical test so let's type in a message and I've typed in created contingency table of observed values using the heart disease and the resting ECG columns so if you're interested in biostatistics or you can remember some of your biostatistics courses remember that we do use a contingency table such as this because we are trying to look for an association between two categorical variables so what it's done here it's got heart disease those without heart disease and with heart disease there's the two rows in this contingency column and resting ECG it's got three classes a normal ECG echocardiogram ST segment changes that's the ST and LVH is left ventricular hypertrophy and now it counts these joint frequencies so there were 327 observations out of our total of 918 that had no heart disease and a normal ECG so that's our contingency table and what it's done it's also generated the Python code for us and you can certainly copy and paste that and see if it runs so I've asked it for contingency table of expected values and it's generated the Python code for us but it's also shown us here the contingency table and now this is expected values under a null hypothesis and the null hypothesis would say that heart disease and resting ECG are independent of each other they're not associated with each other and for instance in our postgraduate biostatistics course that I teach in lecture two we deal with probability theory and I show you how do we know if two variables if they are independent and certainly that is what we use to generate this table so if you're interested consider taking a postgraduate course in biostatistics and you'll learn all about these things one thing that we have to check for here if we want to use a Pearson-Sky-Square test is that all these values or at least 80% of them should be 5 or more and we see definitely the lowest value expected value we have here is 10.49 so that's what we would expect given the number of observations if these two variables are independent of each other so that's one of the assumptions you must look for now let's see if it can actually conduct Pearson-Sky-Square test so I've asked it to perform a Sky-Square test for independence to determine if the heart disease variable is independent of the resting ECG variable use a 5% level of significance and I've just added a little bit here saying comment on the result and there we go we see a p-value very small definitely less than the alpha value that we specified and then it also gives us a critical value there now you see the Python code there so certainly copy and paste that and see if that works but it seems as if it's if it's done the proper thing and as far as a critical value for two degrees of freedom again if you take the class I'll teach you all about degrees of freedom and how it relates to this Pearson-Sky-Square test but it used the degrees of freedom there which it got back from there this function actually turns a tuple and so we get the critical value of 5.99 and of course the x-square statistic that was calculated from the Pearson-Sky-Square test would be well above that giving us the small p-value and we can reject the null hypothesis that these two categorical variables are independent of each other and let's see this suggests that the resting ECG measurement may be useful in predicting heart disease status in particular the ST and LVH readings have a different distribution for those with heart disease so it's actually you know given us some good information here let's read this first part though since the p-value is less than the critical value of 0.05 we can reject the null hypothesis and conclude that there is a statistically significant association between heart disease and resting ECG which is exactly what we asked to do as far as the Pearson-Sky-Square test or the Sky-Square test for independence or the Sky-Square test for association once again statisticians many names same thing and but it certainly seems to have done the job properly here so Claude AI this is what you would do I've added a little extra at the end which is not so many statistics and data visualization but I thought let's just give it a go and see if Claude AI can do that it seems to me that it does a good job once again as with all of these I think it's very important that that you have some experience in biostatistics that you have experience with Python or R to make sure that you know what is returned here is actually correct so there you go it's going to work now you need a little bit of knowledge of either Python or R as we could see it could generate that code for you so it's not always going to give you all the results that Chatchapiti gave you for instance creating those actual plots you'll have to run some code as well but I think there's nothing wrong with it so Claude AI seems to be a good tool why didn't you give it a go?