 In the first tutorial we looked at how to use chatGPT, that's GPT4 and chatGPT+, to do our exploratory data analysis. So we uploaded a CSV spreadsheet file onto chatGPT using the code interpreter and we just used some normal prompts that a little bit of prompt engineering to get the results that we want. Now what we're going to do today is play with one of the plugins and a notable plugin is notable.io. So notable.io is a website all on its own. You can sign up for free and I'll just show you briefly how to do that. And it gives you access to a notebook environment. It's a very special kind of notebook environment because you can share with others. It actually has some added functionality that you won't get in a normal notebook. So if you haven't seen a notebook before, just look up Google Colab or Jupyter Notebooks. It is a notebook environment. It's a page that looks very much like a Word document other than the fact that it runs in a web browser. And with a Word document, remember, you can format your text nicely. You can add images. You can add sounds even. There's lots of things you can do with a Word document. But with a notebook, you can also add code and see the results of that code. So it's beautiful for a research document because you can put out all your thoughts, your ideas. You can just write them all out, format them nicely with headers and sub-headers, sub-subtitles, etc. But you can also put in code and you can see the analysis, the results of that code. And that's exactly what Notable.io gives you. It gives you a notebook in the cloud. Now ChatGPT allows us to add some plugins. So you have to go into the plugin store and search for the notable plugin. Now, you've got to set these two things up that they work together. So when you do that in store, it'll ask you to do a few things while on one tab of your browser you've got Notable open. And the other one you set up your Notable tab and it takes a little bit of setup. And I've done it a couple of times and every time it's sort of different. And I think it's just how the updates roll. So you'll have to see what it asks you and just go with it. I mean, it's not really not that difficult to set up the plugin. What it would probably ask you to do is to set up two-factor authentication because when you try and keep things secure. So why would we go through all of this? We just used ChatGPT in the previous tutorial and it gave us all the results we wanted. The problem is, even though that chat that you're having saved on the left-hand side, on that sidebar, maybe you want something more permanent. Maybe you want something that you can share with others. Maybe have this document that others can work on as well. And that's what Notable.io is going to give you. So what's going to happen is while you chat and the results are being created in ChatGPT, it's also going to populate the notebook. And in the end, you've got this notebook on Notable.io that you can just use again. You can add to it just by going into the notebook itself, not using ChatGPT, and you can share it with others. As I mentioned, it is a real nice environment. So we're going to open up Notable.io on one tab, and we're going to open ChatGPT on another tab in our browser, and I'll show you how it works. Here we are on the Notable website. Note that it is notable.io, and this is where you're going to sign up for a free account. And this is the access that we have to a notebook that will do the code for us, where the code will be generated. But it is a notebook such as Google Colab or Jupyter notebooks, and as much as that you can have images, text, you can have nicely formatted text. Hopefully you know or have some experience with notebooks. So go ahead and sign up for that free account, and then we're going to use the Notable plugin in ChatGPT+, that will populate the solutions that ChatGPT gives us right inside of our notebook. Now I've already signed up for a free account, so I'm just going to log in. Once you've logged in, you'll get to your space, and this is where you'll see all your projects. Now I've created one here called My First Projects, and what I'm going to do is just to click on that, and we get to a page where we can create our first notebook. Now that's not what we're going to do, we'll just be in this project. Now what you're going to do, you're going to go up to the URL at the top of the browser, and you're going to select that URL, and you're going to copy it into your clipboard because we need to paste it into a chat session on ChatGPT. So here we are in ChatGPT, this is a brand new chat, and I've selected GPT4 since I am on the GPT+, service. On the little drop-down box, we'll see that I have selected the Notable plugin. There's a Notable, and I've put a little check mark there by clicking on that open box, and we can read on the left-hand side. It'll say, create a notebook in Python, SQL, and mark down to explore data, visualize, and share notebooks with everyone. So you'll have to search for this Notable app if it is the first time that you're using it, and make sure that it is available in your ChatGPT session. So there's my first part of my prompt. I said use this project, and I've pasted that URL from the Notable website where my project was open. Let's just go back to that page, we can see where this project is open. There's no notebooks here, or if you've created notebooks, it'll be listed here. But in this My First Project project page of mine, there's no notebooks, but it is the URL at the top here that I have copied and pasted. So here we go, I'm just saying use this project, and I've just pasted that URL in. Now I've submitted this as a message, and the response is understood. I've set My First Project with a project ID. It gives us an ID as the active project. How can I assist you further with this project? So my next message I'm just typing create a new notebook called eda.ipynb. Now the .ipynb, that is the extension of a notebook file. So add that, and I'm just going to call my notebook eda for exploratory data analysis eda, which is once again the example that I'm going to use. So let's click on the send message button and see what happens. And you can see the Notable plugin is working. It's saying using Notable, and the little wheel is spinning. There's some work happening behind the scenes. Now if we open this message, we can see the actual code that was used in sending this message to Notable so that Notable knows what to do. And we get the message back. I've created a new notebook named eda.ipynb for you. You can access and follow along with the notebook using the following link. Now before we click on this link, let's just go back. And now on this My First Project page, we see we actually have a notebook called eda.ipynb. So that notebook was generated. So I can either open it here, or let's just follow the link that was given to us by ChatGPT. And there is our blank notebook. And suppose we can make a whole video tutorial about using Notable as your notebook environment. It is similar to using something like Miniconda and installing your own Python environments such that you can use JupyterLab or Jupyter Notebooks, similar to using Google Colab. If you've used that before, it's just a notebook environment. And it works pretty much the same as many of these, other than the fact that there's some added extras. And I would implore you just to take some time and give Notable.io a try, not through ChatGPT, but just as your notebook environment. Play around with it. As I said, there's a few more added extras here, which you're not necessarily going to get with other notebook environments. So my next message I'm saying add the title, and then I put it in quotation marks, data analysis for the heart disease project. So I'm just asking it to add a title to my notebook. And if you've never seen notebooks before, it is this idea of being able to mix text, to mix titles, subtitles, paragraphs, images, and code, and the results of the code all in one place. It's like a Word document, but much, much more powerful. So let's have a look. Again, we see that using Notable, that plugin is working. It's generating some code that it's sending to Notable. And it's now working on your notebook on your behalf. And it says I've added that title. So let's go back and have a look. And there it is. It's put a markdown cell. Now that's what we call these. See these little plus signs? If I hover over them, I can add a new cell above the current cell, or I can add a cell below the current cell. Now the notebook started with a cell right up there. And, you know, we haven't used it because we just asked it to add a title. And you can see the text is quite large. This is formatted text. And the language used for formatting the non-coding part of a notebook is called markdown. And you can see that this cell is a markdown cell. And if I click it open, I see I can do Python, SQL, Markdown, or create a form. So these are the cells. And let's go to this one at the top. We see that that was a Python cell. So let's just type some Python code in there. I just want to show you. Let's do two plus two. And I'm just going to execute it. See that little play button there? This is going to execute that code. And that's Python code that's going to run. And we see, you know, the result printed right below that cell. So this is the whole idea of cells. So we'll have coding cells. When you execute that cell block, you get the results. And then we have normal text cells like this as well. So if you wanted to just create a new cell above this or below that manually, you can do that. You can use this absolutely without chat GPT. Now I've typed add a subtitle. And I'm going to call it data import. And you can see I'm only capitalizing. It's any uppercase for the D. It's just a bit of styling choice on my side. Just so that when I look at the notebook later on, it's nicely styled, nicely formatted. And I try and keep things neat that way. Okay, so that's been added. Let's go to the notebook, see what's happened. And definitely we see data import there as the next cell. And we can see clearly it's smaller than the previous title, which I've kept as a very large text. And I just asked GPT to make that a title. And it's used the biggest font that it can. So let's just double click on that. So if you double click on a text cell, a markdown cell in other words, it'll show you how that code was generated or at least how that title was generated. And it was generated by starting off with a pound symbol or hashtag symbol, then a space. So a single hashtag, that means the largest text size. Or the largest font size. Then if I look at the data import, double click on that, you see it's two hashtag symbols. So that'll be one size smaller. And you can go up to six hashtag symbols or six pound symbols. That'll give you the smaller sub, sub, sub, sub, sub, sub title. So that'll be the smallest. If you don't put anything there, of course, that's just going to be normal text. So that is how it's generated. So now if you want to generate your own, now you can always just add a new cell there. You can choose whether it's going to be Python, SQL, Markdown. So let's keep it Markdown. And I'm just going to write this as a, this is normal text. So you can just type that there. If you want something to be written in bold, you can do either two stars and say bold. Two stars is bold text. The other way to do that is just double underscores. And you can just say bold again. So either double underscores before and after the word. And if you put a single one, it's italics. So it's either a single star or a single underscore. And then that'll be italicized. And so if we just run this code block, I just held down shift and hit return of shift or enter. And you can now see that's normal text. This is normal text, bold is bold text, bold again, italics. So you can absolutely just add your own here. And you can also move them around by just clicking and dragging on them. See, there's that little symbol right there at the top, these little four-colon symbols in a row. So you can definitely do that. And of course we can simply just delete it. Now, one of the beautiful things about this is that people can work on notebooks together. So you can share these notebooks and someone can leave a comment for you. And so if you collaborate, you can see on the right-hand side, many people can work on the same notebook, the same data analysis, or you can share that with your colleagues. For now though, I'm just going to click on this little trashcan and that's just going to delete that cell for me. So we've done this cell now, data import. Let me also delete this one. So we know we are stuck with this data import. That's the last place that we left, that we left chatGPT. So let's go back to chatGPT. I've got some data, a CSV file, comma separated value spreadsheet file on GitHub. And I've just pasted the URL for that CSV file on GitHub into my next prompt. And I've typed import the data from the URL and then I give it the URL. And then I say and assign it to the variable named df. So let's do that. So the one thing is, it's good if you have a little bit of Python knowledge. It'll help you along. And because I specifically am forcing the fact that the importation or the importing of this data set, the spreadsheet file should be assigned to a computer variable that I'm calling df. It's quite a common thing to call your spreadsheet files that are converted to data frames to call them df. Let's just have a look at the code now and exactly that is what's happened. It said import pandas as pd. So that's the way if you don't know how Python works. There's so many libraries out there that extend the functionality of the Python language. And one of them is called pandas. And pandas is great for importing and working with data. So if your data is in a spreadsheet format, you can import that spreadsheet and then you can manipulate that data. You can analyze that data. You can create plots, et cetera, from that data. And so the pandas library is one of the success stories or reasons why Python is such a success in data science itself. So import pandas as pd. And that's another common thing to do. Pandas, it's six letters to type pandas all the time because if you want to use this extra functionality in this extra package, you have to reference the package in the format that we're doing the import now. There are other ways to do it. We can discuss that at a later time. But it's 1, 2, 3, 4, 5, 6 characters and we can just use these abbreviations. So we say as pd. So I don't have to write pandas every time. I can just say pd. So here we have df equals. So the equal symbol in a computer language is not the equal symbol that we get in mathematics. Yet it's an assignment operator. It assigns what is to its right to whatever it's to its left. And to the left of that assignment operator is the name that I asked for df. So we see the df. And then it's going to use the read underscore csv function in this pandas package. So it's saying pd.readcsv. Read underscore csv, that is the actual function. And now it's passing that URL as a string object to this function. And that now will import that data from GitHub and it'll assign it to this variable df. It'll assign it to that variable df. So that is a beautiful thing. I just wrote some stuff in chat gpt. The data was somewhere on the web and it's just imported into this notebook. And I've mentioned this before, but the beauty of all of this is if you work in chat gpt, sure enough you can save or this chat that you're having will be saved. If I were to open this, you'll see all my previous chats that I've had. It'll be saved there, but it is much, much nicer to have this notebook generated for you from all the results and you can just go back to this notebook, share this notebook. Everyone can look at the data analysis that was done. It is such a nice thing to have a research notebook such as this. Now, if you've seen my last video, you know that I like to tell chat gpt about the data, about the variables. And I would actually use much more information that I've written here, but you can read that. And that just gives chat gpt some background knowledge about the data set. If you are the one using and analyzing this data, it's good that you know something about this data and tell chat gpt. Chat gpt will do a wonderful job in interpreting the results, but you can give it a lot more power, a lot more accuracy if you tell it a little bit about the data. For instance, describing what each of the variables are. Again, if you just watched the previous video in the series, I asked for a specific EDA on the numerical variables, the categorical variables, and we'll do that here as well. So it's just good to tell chat gpt about these, perhaps the encoding. Our response variable here is called heart disease with a zero. If someone does not have heart disease you can tell chat gpt about that encoding so it knows what that zero and that one means. Usually I'll be more verbose than this, but for the sake of this tutorial there's a bit of brevity there in my prompt and I'm just going to tell chat gpt then about this data. Next thing I'm going to add another subtitle and I'm going to call it univariate summary statistics and data visualization. So let's just get that little quotation mark in the right spot. There we go. So my next prompt I'm typing add a subtitle univariate summary statistics and data visualization. As before I said I'd like to keep these notebooks nice and neat. You're going to use them again a couple of weeks, a couple of months later, someone else is going to use them and it's very nice if they are formatted well so that number one you know what you were up to a couple of weeks ago but you can't remember the status set anymore or the state analysis anymore or again if you shared with someone that they know what you're trying to achieve. And then lastly of course we'd like our work to look neat. So let's just go back to the notebook and see where we stand. So there we go. We've got our data analysis for the heart disease project. Data import as a subtitle. Here's a new subtitle univariate summary statistics and data visualization. Our data was imported there. Now usually I would do a bit of sanity check on the status when it is imported. Let's do that just for fun. I'm going to open this just to show you if I say pd.df is what our data frame is called df.info so df is my object that is now a variable name, a computer variable name that makes a little space, creates a little space in your computer's memory and to that little space in the memory is assigned this object and this object is a pan as data frame and it is a representation of that spreadsheet file and now I'm going to call a method which is a function applied to an object that already exists so the df data frame exists so I say .info and it's going to use this data frame and give us back some info on this data frame so let's execute that and this is still under the data import and then we get some information about this. We see all the variables, the column headers age, binary six, cholesterol resting ECG, max heart rate exercise and gyna heart disease we see non null count there's 918 values so 918 non null so there's no missing values and you can see the interpretation of what python thinks these data types are so it thinks that ages are in 64 that's 64 bit integers integers are whole numbers object that is python speak for a categorical variable it is the values that are in that column in 64 again object again so you can see what kind of data python thinks at the moment this is let's just add one more I'm just going to say df.head so I'm calling the head method on this df object and head by default is just going to give us this nice little table of the first 5 rows of data and python is 0 index so it always starts counting at 1 so the first observation here is observation 0 and this person was 40 years old it was a M4 male binary agenda for simplicity used here the initial value was 289mg per deciliter and so you can just scroll along and see all of those first 5 rows of data and you also get a nice little graphic little bit of a plot there to give you some idea of the data so binary 6 is a binary categorical variable and there's 2 categories in there and we see these little bar plots giving you some indication now I said that this notebook the environment of notable adds a little bit more and one of the things is that you get this little visualized button which you're not going to get in the Jupyter Notebook and you're not going to get in Google Colab and if you click on that it opens up this builder here you can filter the data, assemble from it and you can build all sorts of different kind of plots by just clicking on things and that will generate the code the python code and actually create the plot for you now that's not what we're going to do so back to chatGPT now next up I'm asking for a sub subtitle and what chatGPT is going to do it's going to send a message to notable to use 3 pound symbols or 3 hashtag symbols in that text so we're going to go to the third largest font size let's do that and I'll show you what it looks like let's go to the notebook and you see there it's even smaller age analysis so what I'm going to do I'll probably run through all of the variables each and every column and I'll do a bit of summary statistics and data visualization for each one of those univariates I'm only looking at age then I'll only look at binary 6 then I'll only look at cholesterol so I'm trying to just to start to tease out this knowledge, this information that's inside of the data set I'm trying to understand what's going on here and one good way to start with is just to look at the individual variables and that's why we're calling it univariate summary statistics back to chat gpt so next up I'm asking you to calculate summary statistics of the age column, include the count mean, median, variance, standard deviation minimum, maximum range, quartiles and the interquartile range so I'm trying to be specific here in telling you what summary statistics I want but you could also just leave it at summary statistics and it's going to decide which ones are most appropriate to say that the large language model works there we can see the results, it's all there but the beauty of all of this, it'll now be captured inside of our notebook and it lives there for later use, let's go have a quick look so there you can see the code that was generated it's asked for some description so the df, you can see the df and then in square brackets we're calling the age column dot so that's a new object now that's called a pandas series object by the way and we call the dot describe method on it which will do a bunch of summary statistics and then it's also adding a couple more and then we just have this idea of a nice little print so this is actually a python dictionary object that's going to do this nice print out for us and we can see the count the mean, the median, the variance standard deviation, minimum, maximum range first quartile, median or second quartile third quartile and the interquartile range you can start seeing this notebook is coming together quite nicely as a research document next up I'm going to say create a histogram with a kernel density estimate of the age column add the title age distribution add the horizontal axis title age add the vertical axis label count and add a grid to the plot let's see what is generated it's generated our plot, let's go have a look so there's the code actually execute the code and sometimes it won't, so if it doesn't remember you can always just click the little play button there and then that block of code will be executed and look at that lovely plot we see a nice histogram with a nice grid at the back, I like the grid you can obviously do whatever you like we see the title and we see the axis labels age and count and then we see the kernel density estimate the continuous graph here that was drawn to show this distribution of the age column and there's all our Python code and what a wonderful way to learn Python, if you know a little bit of Python and you want to extend your knowledge at quite a rapid pace you know let's chat GPT, generate the code for you inside of a notable notebook and you can see how all of this is done so use the seaborne library that is a library that uses matplotlib and there's matplotlib there as its back end is called and the seaborne just makes life a little bit easy, it's less code to generate these plots but it sits on top of matplotlib that does all the hard work and matplotlib is such a magnificently large and complicated plotting package in Python again one of the true success stories or why Python has become the de facto language for data science matplotlib is just another one of those reasons because it is a behemoth there's almost nothing that you cannot plot with matplotlib but because it's so big, so complex because it allows you to do so much it's sometimes a little bit difficult to work with I do have a video on this channel and it'll show you in 90 minutes how to get on top of matplotlib so you can search for that and then as I said seaborne sits on top of that and it just simplifies the code a little bit and there we can see we called the figure function from piplot so it's the piplot module inside of the matplotlib library that has this function called figure we're setting the size of the figure SNS is now the abbreviation we're using for seaborne and it has a function called histplot, histogramplot and we're going to plot a certain pandas series which is that column of the age we said KDE, kernel density estimate that should be true and arbitrarily it's decided that there should be 20 bins you can absolutely control that right inside of chatgbt you can tell it start at 20 in that 100 and do intervals of 10 you can state that in words it'll generate that code for you we see plt.title, plt.xlabel plt.ylabel we notice that these had to be passed as strings they inside of and you can use double quotation marks or single quotation marks we see plt.grid set to true and then eventually plt.show which will actually generate the plot for us so what a beautiful way for us to learn how to use python now back to chatgbt so certainly watch the first video in the series where I showed you a few more things to do as far as eda is concerned I'm going to add this little extra bit to this video now just to entice you let's just add a new title and I'm going to call that subtitle add the subtitle statistical analysis to the notebook and then also the sub subtitle ECG diagnosis grouped by heart disease and I'm adding that as a sub subtitle so just as a little bonus at the end of this video let's just do a little bit of statistical analysis instead of just the eda and so there's my title and subtitle has been added and I'm saying create a contingency table of expected values using the heart disease and the resting ECG columns let's do that and there we go there's a contingency table of expected values heart disease 0 and 1 that's no heart disease and heart disease along the two rows and along the columns we'll see as far as the resting ECG columns concerned that's a multi-level categorical variable there are three levels or three classes left ventricular hypertrophy normal ECGs and SD segment changes and we see a table here of expected values under the null hypothesis that these two categorical variables are independent of each other they are not associated with each other and if you're taking a course in biostatistics specifically with us at GWU at the Malkin Institute School of Public Health we have our postgraduate course PUBH 6002 for biostatistics you'll learn all about these tests so if you want to see if two categorical variables are associated with each other a test that you can do Pearson SkySquare Test for independence or Pearson SkySquare Test or the SkySquare Test for association many names for the same thing that's what we do in statistics we give the same thing many many names and so that is one test that we can do one of the assumptions for that test is that these expected values must be at least 5 and we can see 49.7 83.3 50.2 84.6 are well above 5 one of the assumptions for the use of Pearson SkySquare Test that is met there and then lastly let's just create a prompt here where we actually do the SkySquare Test I'm going to say perform a SkySquare Test for independence or you can give it any of its other names to determine if the heart disease variable is independent of the resting ECG variable use a 5% level of significance let's add a little bit more write a comment now you can actually see me type slowly write a comment about the results there we go so we see a x squared statistic there of 10.02 that is quite big and hence our p value that is much much smaller than our level of significance and our level of significance remember that this distribution of x squared values will follow a SkySquare distribution remember how to calculate that that will be the number of classes in the 1 variable minus 1 times the number of classes in the other variable minus 1 so in this instance 2 minus 1 is 1 and 3 minus 1 is 2 2 times 1 is 2 so under 2 degrees of freedom we will get a SkySquare distribution we are willing to make a type 1 error 5% of the time and we see that this value that we have 10.02 or more extreme for that distribution will occur in 0.0067 proportion and that is a value with a maximum of 1 times in other words this is a very low probability of that happening a small p value and it is written the comment there the variables heart disease and resting ECG are dependent suggesting that there is a significant association between the ECG diagnosis and the presence or absence of heart disease so let's go back to our our notebook and we can see how all this was done here is how the contingency table was generated let's have a quick look at that I will just play that so we can see there is our table of expected values and here is how the SkySquare test was done there is a module in the scipy package scipy for scientific python the stats module inside of that stats module there is a function called chi2 contingency and that is going to perform the chi2 test for us we have created with the cross tab function a table of observed values in other words a contingency table from these two panda series objects the column with heart disease and the columns with resting ECG and then we are going to call the chi2 contingency function and pass this contingency table to it and that is going to return a tuple of values first of all will be chi2 that is our x squared statistic the p value for that and it returns to other values as well we just put underscores there we are not interested in them and then we set an alpha value of 0.05 and then write this text out that will use all of those results and if we run that we see there is our x squared statistic there is our p value and then the conclusion these variables are dependent so have a look at this notebook it is a wonderful research document and it was all generated without us lifting any coding fingers in other words we just used chatgpt and we just wrote a bunch of prompts and it generated all the results for us but then also populated this notebook so this is a wonderful research document which we can now share in our team there is a little share button up there and we can come back to it and just continue our analysis and if you know python and of course you need not use chatgpt you can just go in between any of these cells and generate your own text so remember if you wanted to leave a comment say here it is very easy just to add another cell here and if I add another cell there let's make it a markdown cell and you can say something about this it seems like this is there is a little bit of a left tail there you can say something about this result you can leave behind the results that you do see here or right at the bottom here we can add another cell here make it a markdown cell put a markdown cell there and say there is an association between these two variables and it might lead us to do some further analysis so wonderful here the notable plugin for chatgpt is absolutely fantastic and I urge you as I mentioned before just to give a look as well just as your coding environment now as always think about getting one of these paid plans because one thing you don't want to do is to put any sensitive information out there obviously don't do it with chatgpt either you must have permission to put the data to use open tools like this tools that the data is going to be scrutinized or available to a third party which is not always legally the right thing for you to do and so just make sure that you don't do that with some of these paid plans of course you can safeguard that but please if you are working with sensitive data please make sure that you are allowed to use that data in tools like this of course it is a wonderful tool when you just use simulated data or open data there is so much open healthcare data out there today and if you use that of course you should not be too concerned about using tools such as chatgpt and notable and especially then if it is the simulated data so how was that tutorial for you I hope you found it at least interesting and can see all the potential that there is if you have never used with notebooks now is the time to start looking at that notebooks what you also saw is that we were generating some python code chatgpt was generating the python code so if you know nothing about python it's a brilliant way of learning how to use python because that chatgpt with a notable plugin is going to generate the python code for you and you can see what it does and see how python works is a beautiful learning environment even if you're not interested in that python at least you have a very nice research document as I said you can go look at that document next week if you formatted it nicely and you added some nice text to it you'll remember the kind of analysis that you were trying to do or if you comment on the results you've got all that there but if you also want to share with your colleagues you can share that notebook and lots of people can work on that notebook as I mentioned just be careful of putting data on there that you're not supposed to share online