 It's Sunday and I'm working at the University of Cape Town campus. We're going to have a cabinet meeting in South Africa today. I'm sure we're going to find out that there's going to be severe restrictions and I'm sure the university is going to make plans to certainly close many of the campuses. What I want to talk to you about is an introduction to two videos. I'm going to use the same introduction for two videos and that's on the data analysis of open data that's available for the coronavirus. Now the Wolfram repository data repository has three datasets that's available. It comes from the WHO and it's updated very regularly. Just mind that when you do the data analysis it's going to look slightly different from the data analysis that I'm going to do because it is regularly updated. It does lag a few days behind so you're going to see data up till a couple of days ago. So we're going to do two videos. I'm going to show you data analysis one on the medical data that the WHO has made available and one on the epidemiological data. So as I said the same introduction video is going to be for both. So I do have courses on the use of the Wolfram language for data analysis specifically for the health care related field. The Wolfram language of course lends itself beautifully to data analysis. And then the data repository has a lot of open source data curated by the Wolfram research so that we can use the data in our analysis. So there's a lot of data analysis videos. Wolfram themselves make some but I want to give you just something from the perspective of a doctor, from a surgeon working in a hospital where we are preparing for the management of these patients. So have a look at these videos. As I said one is going to be on the epidemiological data and one is going to be on the medical data and the use of the data analysis. If you want to learn more about the use of the Wolfram language of course I do have courses. You can get a University of Cape Town certificate through Coursera if you take the course on the use of the Wolfram language for biostatistics. So let's go in and explore this data set. Some of the data is sparse. It's difficult I think for the world to gather this type of data in as much as a worldwide effort to try and collect this data. So some of the data is going to be a bit sparse specifically when it comes to data. For instance about the symptomatology, the symptoms of the patients but it is a large data set and we can still learn a lot from it and that's what all this is about. When I teach healthcare statistics, data science, data analysis I really urge people to get to the story that the data is trying to tell you. There's information lock in that and that information is knowledge and that is exactly what we need right now. So let's analyse some of this data. I'll show you how to use the Wolfram language to do this. So let's go have a look. So here we are in Mathematica. I'm using desktop version here but you can definitely open the cloud platform on the web and do all of this free of charge. The first place to just go and find out a little bit more about the coronavirus is of course in Wolfram Alpha there's some curated data here and if you simply pass coronavirus as a string to the Wolfram Alpha function you are going to see a bit of information and you can also ask to have it interpreted in a certain way and you're going to find either some medical information or some information on the virus itself. So have a look at Wolfram Alpha. Of course we are interested here in the curated data repositories that are available for the coronavirus on the Wolfram server. So there are three data sets. I'm going to use the resource search function and just pass the string coronavirus to it and it's going to look through the Wolfram data repository and find all the data sources that is available on the repository and we see they are three, the epidemic data for the novel coronavirus, genetic sequencing if you are interested in that and then the patient medical data. Those are the three data sets that are available that are available on the Wolfram server and we can certainly look at all of them. So here we are. I'm using Mathematica on the desktop and you can certainly use... So here we are on the Wolfram desktop but you can certainly log on through your web browser and do all of this online and also free of charge. But I'm using the Wolfram desktop here just running this all on my own computer system. So the first way that we can find out something about the coronavirus is just of course to use Wolfram Alpha and I think most people are familiar with Wolfram Alpha and I can just simply pass coronavirus as a string to the Wolfram Alpha function here inside of the Wolfram desktop and that's going to give me some information about the virus. I can also ask it to be interpreted as medical data and if you have a look at that, you can certainly look through that. What we are interested in though is the data sources that are available in the Wolfram data repository. There are three of them as we can see here. So I've just passed coronavirus as a string to the resource search function and we can see that there are three data sets that are available. We're going to look at two of them, the epidemic data and in this video we're going to look at the patient medical data and there's also of course as you can see in the middle there one for the genetic sequencing if you are interested in that. So this is curated data which means we can use it right inside of the Wolfram language but you can certainly of course just import a data set that is captured as a CSV or another data file. This data comes from the World Health Organization through Johns Hopkins as far as I'm aware and it is updated regularly but it is not up to date as far as a specific data is concerned. It's also important that you know that this video is recorded on a certain day and when you look at this video there's going to be newer information added to this. So it does lag behind as far as the real world data is concerned but at least we have fairly up to date data when it comes to these three data sets. So let's start immediately. We're going to do importing of the data set, the medical data set for this video as I mentioned. I'm going to use a computer variable called medical data. I'll highlight it there for you, medical data. I'm going to use the resource data function and then I'm just simply going to pass this data name that we can see up here. Patient medical data for the novel coronavirus COVID-19. So I just pass that as a string and that is going to import that data set from the Wolfram data repository. And there we go. It is saved as a data set as you can see here and we can see it's not going to fit on the screen. There's certainly a lot of data here. We can see it starts with age, the sex, city, administrative division which would be the province or state or region where the patient is from, the country, geoposition and as we move across there's some data on set of symptoms and not all of those data points have been captured and that is what you're going to find with big open data sets like this. A lot of people on the ground trying to collect this data it is not always going to be captured. So there's not going to be granularity in the data. In other words there are not too many variables here so we're unfortunately not going to get all the information that we want but there is still a lot here. There's a lot of empty data cells here especially when we move across and we look at the reported market exposure, chronic disease. Chronic disease would be a very important one and this empty cell does not mean that the patient doesn't have chronic disease just that that data wasn't captured at all. Especially when it also comes to death the data sets do lag behind when it comes to the capturing unfortunately of mortality rates so we're not going to have up-to-date information on the mortality rates in open data sets like this. So in some cases this is a sparse data set and there are other problems as well and we're going to deal with them as they come up. One that we can see right at the bottom here immediately is that the symptom list here is not dummified and by that I mean there is not a column for fever and a column for pneumonia and a column for cough and just yes or no as the sample space elements for that variable. They are captured as list elements inside of the data set and we certainly will have to extract them if we want any information on that. So given that it is a sparse data set it's still there's a lot of information in here and certainly from a medical point of view I find it quite useful and a lot of information that we can get from this. As always with a data set I want to know what variables they are. We can scroll across the screen and see all of them but let's just extract them as a list. Remember if I want to extract something as a list I'm going to pass whatever I want as a list to the normal function. The normal function is going to return a list for me and because this is a data set we're going to use data set notation so I use the name of the data set, medical data that's the computer variable name into which we've captured our information and I'm going to use indexing the first row that we have and I want the keys. The keys are the column headers. Those are the variables, the statistical variables and you can see all of them. We've seen the first couple of ones age, six, city, etc. and in the end they're data of discharge, data of death, death queue. That's typical of Mathematica or Wolfram language. Queue meaning is the so to a false, a boolean value and as I said unfortunately that data is not up to date as the data set becomes available on a daily basis. Also we might just be interested in some of the information about the data set itself, where it comes from, etc. So I'm just going to pass this to a computer variable called met data resource and I'm using the resource objects function there just passing the name of the data set to that. This is now an object and we can look at all the properties that are available for this object. So we can see auto update categories, content, element locations, content elements, content size, content type, contributed by, contributed information, etc. Now what I always like to do when I have a lot of properties like this is just wrap it inside of a manipulate function, fantastic function inside of the Wolfram language which means I can just go through all of these one by one or select the ones that I'm interested in and I don't have to type it all out. So I pass it to the manipulate function and we're going to say met data resource and then use a variable there in and we're going to iterate over that in what I want there as a list of all of these properties. So I just pass that as properties and if I run that, very useful, you can see all of those properties as a dropdown list now. So I can just see it as a dropdown list and get that information if indeed it is available, latest update there. At the time of this recording, you can see it was the 10th of March. Now I'm not recording this on the 10th of March, it's certainly a week or so later already but that is the latest information that is available as an open data set by the WHO. Another thing just to be aware of is the fact that we might see strings here like China for the country but inside of the data set it is not a string that is captured, it's an entity and there are a couple of ways that we can extract entities. One is through the interpreter function and the interpreter needs to know what kind of word you are passing to it so it wouldn't know what United States is. I'm telling it it is a country or please at least of course the Wolfram language is going to know that this is a country but I want the interpretation of it to be a country and then I'm just going to store that in a computer variable USA and when we run that code we see that we have an entity, United States and that is how it is captured in the data set so when I want to search the data set I must search with this entity not with a string United States so let's just start doing something useful with this data I'm interested in the first column there the age most definitely so let's have a look at that the first thing that I might want to do is just some descriptive statistics and a good way to go about that is just data visualization it's a numerical variable continuous numerical variable although these are captured of course but a histogram would be excellent for that first I want to extract all the values as a list so I'm going to wrap that up into the normal function and what you're going to see throughout here is quite long lines of code here and it's sometimes difficult if you're not used to the Wolfram language to decode what is going on here always just start in the middle so there's medical data which means I'm using data set notation I'm saying via indexing give me all the values down the age column so take the medical data data set give me all the values down the age column but wrap that inside of the normal function which means give it back to me as a list object but I'm just warning you that not all the ages were captured as integers some of them were just captured as intervals and it just says age 80 plus and that's certainly not a number that I can do any kind of statistical analysis within I can't visualize that certainly not in a histogram so I'm wrapping all of this up inside of the cases function and the last argument here is underscore integer so only include this if that data is an integer and this is a very easy nice way in the Wolfram language just to clean up the data right in one short line of code there so I've stored all of that in the all ages variable and now we can do a histogram so what I'm asking here is histogram of all the ages I know it's just a list of numbers now there's not going to be any problems I'm giving it a nice plot label I'm putting access labels on remember it's always first the x-axis and then the y-axis I'm putting age on the x-axis count on the y-axis and as always I like large images I like to be able to see what's going on so image size is large and there we can see the distribution of all the patients in this data set and as we know already now there's very few younger people infected which is of course from a medical point of view you're going to probably hear me say medical point of view quite often I am a surgeon after all so I do see this data set this data analysis and use of the Wolfram language here not only from the eye of a data analyst or a biostatistician or someone just interested in the data computer scientist I'm also seeing it as a doctor so good to see that young people seem to be spared and we hope that that certainly continues now let's just have a look at the next column that was the gender column captured as 6 there so again look at this we're using data set notation so I paste the name of the data set and then the column all the row, column so the row says all the values down the 6 column but what I want to just find out here is how was the data captured what are the sample space elements in this categorical this nominal categorical variable so I'm going to say delete duplicates so if I wrap all of that up in the delete duplicates I'm only going to see the sample space elements so let's run that and we see it was a binary classification male, female and then they're also empty cells in some cases the gender was not captured so this is definitely a description of seeing gender as a binary nominal categorical variable so once again as with the countries these are not captured as strings inside of the Wolfram data set it is captured as entities so I just have to interpret it as entities so I've just made two computer variables male and female and I've passed the instead of using interpreter what I've done is just the following let me just add a line of code there so you can see what I'm going to do click on the little plus there and say free form input and then just type male or female and that's going to bring up these little text text boxes you can just say yes that I do want these interpreted as a gender and of course we could have used other ways just to capture these as entities so if I just use female there note then it is captured as an entity not as a string and that's certainly what I want to go and do because I'm interested to see it say an age difference between males and females and I'm going to capture that as a paired histogram but first I've got to sort my data set by gender and what I'm actually going to do is my plan here is to extract two separate list objects one for female ages and one for male ages and how should I go about it how do I construct this line of code very simply we go down the middle here we see I'm using data set notation medical data and comma because I'm not using the full notation when I'm looking for rows and columns what I'm doing is passing it as an argument to the select function so that is actually my inside is the select function select from the medical data data set now if I want to go down a row down a column row by row and I want to look for specific entries this is how we do it this is how we set up the Boolean question that we're asking so always the hashtag in front of the name of that column variable column header name statistical variable so it's this hashtag symbol six equals equals female and remember female is now that entity and always the ampersand to say go down that whole column row by row so it says select from the medical data set go down the six column and only capture where it says female and then I'm going to have a little data set aren't I if I just had that part it's going to return to me a whole data set but only where female was indicated but I don't want that I only want age so I'm going to use indexing now so after that you see the indexing the two semicolons mean all all the rows comma the age column so I only want the age values returned to me and then remember as we said that there is a bit of a problem in as much as not everything has been captured as integers some of them are just text with intervals in them so I'm going to wrap all of this inside of cases so cases is just the integer values please and all of that I wrap inside of the normal function so I certainly could have done cases and normal the other way around that wouldn't have mattered and then I'm going to get the same thing I'm going to have returned to me a list object that only has integer values in it and then I'm going to do the same for males and if we do that we have female age and male age and now we can do the paired histogram and look how I've set up the paired histogram I want it to be labeled so I have two labeled functions they're labeled and there's another one and the first one we're going to have the female age list a string name for it female and I want that on the left side of my paired histogram and then the male age with the name male but I want that plotted on the right-hand side I add a plot label as well and as always image size is large so now we're going to be able to see the difference between female and male as histogram and we can certainly see there's a lot more males here in this data set it seems to be affecting more male patients and we can see the gender difference right here on the very young side we see unfortunately there's a few more females so again as a clinician, as a doctor I'm concerned about the symptoms so let's go have a look at the symptoms now as I said it is a sparse data set as far as the symptoms are concerned and they weren't all catch the data was not captured as dummified variables there's not one for cough, one for fever, etc they just all passed and certainly when I teach data science or biostatistics or machine learning deep neural networks when I teach this is very important how we set up the data capture and not have data captured this way now it certainly makes sense when this data is captured it's easier to capture it this way but when it comes to analysis of course it makes life difficult for us so what I've decided to do here is if we go right in the middle in the data set here and indexing so it's the data set, medical data I want all the rows, the symptoms column if I were to only do that if I were to only do that it's going to give me this hodgepodge of empty cells cells with a single symptom in them and then cells where there's this list of symptoms inside of them so I want this extracted as a list object first of all so I'm wrapping that all in the flatten function I'm wrapping all of that in the delete missing function because I just don't want all those empty cells in there then I want the counts of all of those so I want to know how many fevers were there how many coughs were there so that's all inside of the counts function and all of that I want sorted and I want sorted greater first so in descending order so just start in the middle and see how if you see a long line of code like this just deconstruct it and when you want to construct it yourself don't start writing the sort first you're going to get mixed up I get mixed up so I start right here in the middle I specifically started typing medical data all symptoms and then saw what the output of that was I then wrapped it in the normal function and then I saw what the output of that was then to get rid of the missing because I saw the missing values there got rid of the missing and then well I'm interested in all of that please as accounts I know how many times each symptom occurs and then not just a hodgepodge of them I want this in a specific order so that is just the way to construct these and there we can see fever, cough now this data needs a lot more cleaning up because if I look down here there is 38.1 degree Celsius and there's one of those that's certainly fever that data should have been captured as fever there's another one, 39.1 that should have been added to fever but I think from a clinical point of view there's still enough data here for us to get some sort of idea of what the most common symptoms are what are the symptoms that we have to look out for when we see patients like this and a good way to visualise this data of course is a word cloud but I'm not interested in all these small little ones so I think a cut-off of about 25 most common ones would be great for word cloud so let's have a look at this as a word cloud and there we can see fever, cough there's respiratory symptoms etc certainly we can clean up this data a bit more and we'll get a better idea but this word cloud I think from a clinical point of view is already telling us what we have to look out for when it comes to these patients very important from a clinical point of view is also the timelines and there's a lot that we can get from these now I'm going to break it up into a couple of countries because there's certainly some interesting information here first of all let's start with a big country like the United States now remember I've already saved United States as an entity in the computer available USA so let's just construct a data set that only has data in from the United States and the way that we're going to do that of course is with a select function now I could just rearrange the syntax here you've seen me use the syntax the other way around this time I'm doing the medical data, the data set name first then the select function and I'm going to go down the country column equals equals USA and ampersand which means go down that whole column but certainly I could have had the select outside remember we've used that before irrespective of that I'm going to have a new sub data set data set object here and if we look down the country we're only going to find the United States so let's do list the line plot of the dates of confirmation now those are the street values these dates it was captured at a certain date every case but I'm using USA medical and then the row, column give me all the values down the date of confirmation column and there we see this increase a few cases a few cases and then this meteoric rise here of the number of cases unfortunately and as we can see here data here was captured on the 9th of March for the USA here but we can certainly see this ramp up this exponential growth of the number of cases unfortunately now let's do the following I want a timeline plot I hope I didn't say listline plot I'm sure I did, I mean timeline plot of course this is time series data so I want a timeline plot of USA medical and this is what I want from the date all of this, the date of onset of symptoms the date of admission to hospital and the date of confirmation so let's have a look at that and what I've done here is for the second patient so if we go up and we look at list number patient number 2 a 35 year old male here in Washington in the United States the date of confirmation was on the second on Thursday the 16th of January date of admission 19 and date of confirmation on the 20th so if I pass all of that so this 2 refers to that row number 2 so I'm using indexing here so row number 2 and the columns that I'm interested in and what we're going to get back from a timeline plot like this is that we get this text here so we can clearly see what has happened to this patient the date of onset of symptoms there on January and admitted to the hospital the 19th of January and then confirmed to have that on the 20th so important for us if we want to draw down to a specific patient now let's do the following I'm just going to extract the data for all the patients for which we know the following if the age and the gender and the onset of symptoms and death if all of those have been extracted I want that to put that inside of a new data set so I've just chosen these might be of interest to me as a medical doctor or if you're analyzing this data might be of interest to you so let's just see how we set that up because this is an important thing to be able to do when you want to extract data from a data set so let's see how we put that together so I'm going to use the notation that we did when we created the USA data set I'm going to start with medical data as my data set I'm going to say select from that the following and in that select function I'm going to break it up by my by the parentheses that you can see here it starts right here I hope you can see I'm highlighted there and it ends right there I'm going to use my Boolean logic here but I'm going to put the and in between and the and is these two ampersands there so all of these things must be true before that row of data is included in my new data set called symptom to symptom to death timeline TL that's my computer variable that I'm doing so I'm using the select function and now look what I've put in here I'm saying not missing Q so missing Q is going to return 2 if it is missing but that's not the ones I wanted so I'm saying not missing Q the age and missing Q 6 and not missing Q data of onset of symptoms and not missing Q date of death so I'm interested here in only looking at the unfortunate patients who succumb to the disease so I'm looking at the mortality here and then I'm going to just name my columns age 6 date of onset of symptoms and date of death so if we run that code we're going to have a new data set that only contains that information so it's only going to have information on the patients who unfortunately died and this is very important information because we can use this in the form of prediction so just as we did here for one patient we can have a look at a couple of them and we can set that up as a grid and we're going to partition that so this is a long function and you can really just take some time and deconstruct what is going to happen here I'm certainly have a plot range in here I'm starting at about the 20th of January and I'm going into the 20th of March I've added a string template because I want at the underneath every plot I want some of the patient's details and what I want is age and then years old and then sex it's going to say 85 year old female for instance and I'm putting that at the bottom and then I'm just taking that to each of these I'm taking the normal of the sort of the symptoms the first 10 so 10 meaning I just want the first 10 cases and all of that is inside of a partition and inside of a grid as I say take some time and just deconstruct this and you'll see that we have the first 10 patients here 36 year old females have come and you can see the date of onset of her symptoms and the date of death there so very close to each other and I should say 48 year old female the 53 year old male etc and we certainly very important information from a medical point of view unfortunate information but very important information so if other interests might be the time difference between the onset of symptoms and conformation let me correct that spelling mistake right there conformation now remember there are many factors that play here the onset of symptoms to the conformation because it might just be what the health services are like we know in the United States the testing kits are in short supply so but it still is very important information and something that we can really use to act upon so I'm going to save this in a new dataset and I'm going to call it different symptom conformation and start with a medical dataset I'm going to select on that again all these things that I want to be included so if I do all of that and then I'm going to make this column that is a date difference and the date difference is I'm looking at two columns and it's going to calculate for me the difference between those two dates so those are the two columns run down the full length of this new dataset for me and we can see that it's returned this dataset object for me so many days, 4 days, 3 days, 0 days 17 days etc let's just look at how many there are that we actually have this information on and we have information on 839 patients so let's just plot this but I want to plot it in a certain way let's run this so you can see just what the histogram is going to be like but before we do that histogram we just have to make note of one or two things the dates have not always been captured correctly so unfortunately you're going to find negative days here that the patient was confirmed before, symptoms now certainly that does occur certainly locally we've had one confirmed case that a patient is a totally asymptomatic young patient totally asymptomatic but there was confirmation of the disease so you can get these negative dates but in some instances the data is just incorrectly captured and between those two let's just get rid of all of those values and we're just looking at the positive ones so we're going to select and what we want is this quantity magnitude of the difference we only want the ones larger than zero so select, use the select function, use the quantity magnitude function and then inside of the quantity magnitude function we have our data set here only where the numbers are larger than zero and we're going to go through all of them there with ampersand sign and we pass that to the histogram and we can see the days between onset of symptoms and confirmation of diagnosis and in some cases it is absolutely quite long now is there a difference between the onset of symptoms and confirmation between the binary genus here that might be information good information do females take longer before they go for the diagnosis are there some social reasons for that is there a difference at all first of all let's have a look at that so from my data set I'm just going to select the female cases and from the female cases I only want the two columns data onset of symptoms and data of confirmation and I'm going to create two new data sets there female dates and male dates and by now I think you know how the select function how I've been using the select function now I want to set up the difference between the two so I'm going to use quantity magnitude I'm wrapping that inside of the normal function of the female dates we're going to select only the cases where those two dates are available and then we're going to introduce this date difference where I look at the difference between those two dates so certainly two long line of code there but it's very easy to deconstruct so I've saved those two and let's do a box and whisker plot and of course there I've already had a look at it where the differences are and the negative what's on the male side so I'm just asking with a select function there only include for me the ones that are larger than zero and there we go we see there isn't that bigger difference here between the two we've got females on the left-hand side females on the right-hand side quite a few outliers so there's definitely not from a population in which this available is normally distributed we can't use a parametric test so I'm going to use a non-parametric test to compare these it's two lists of values that I have so I'm going to use the man with you test I'm expecting no difference between the two and I want a test data table to be returned to me and we can see a p-value there of 0.686 so definitely not a statistically significant difference between men and women as far as the difference in the onset of the symptoms to the confirmation of the diagnosis is concerned so important information that we can get there geographical data also be important so let's do the following we're going to take the medical data data set just look at the first five rows and look at the geo position let's just have a look at that just to mine ourselves that's what it looks like so we see 70 degrees north 70 degrees east and that is going to allow us to plot this so let's just plot those first five cases that's in the data set that doesn't mean it's the first five patients but certainly just in the data set we have geographic geo in the graphics there's a function and what we want is just to plot these markers so the geo marker is the medical data the first five in the geo position column so if we pass that as a geo marker and pass all of that to the geo graphics function we're going to see a nice plot there of this location and we're going to see the the values there plotted for us have a look at the united states so I'm using here the united states medical data sub data set that I created and again we're going to look at all the cases that are available there we pass that to the geo graphics and let's get an idea of where we stand in the united states and certainly all over the united states perhaps on both sides I should say unfortunately of the country that we see that we have cases we can be more precise than that let's first we select any of the cases where the geo position is known so I'm going to say delete missing normal as a list you say medical data select where this value is available so if we do that we have two positions and I've mapped this association with the entity that is the city as I've created here in the last bit of this code on the right hand side so we get all these geo positions and let's just look at New York City so I'm going to remember I have to save that as an entity NYC I'm using and now we can just look at New York City 50 mile radius around this in this code so you see geo range here quantity 50 and now of course there's more cases for each of these red little dots but in a 50 mile radius here this is where these cases were confirmed so certainly a lot of information is listed in the geo data there and you might even set up where you can join these to see over time in what order of time were all these diagnosis made again mortality let's just come back to the mortality I just want to see and the data set as we stand now how many deaths there were and certainly 40,543 that data is not available that does mean the patient did not die but in 46 cases we know that there certainly was mortality in this data set what I might be interested in is that these patients is the true that they had comorbid disease so I'm going to set up a new sub data set comorbid medical data is my data set from that I'm going to select where chronic disease is not missing and death queue is not missing so from there I want to capture chronic disease queue, death queue and the list of chronic disease so only those three columns and there we go it's unfortunately missed out just asking for that data to be there and the reason why it hasn't done that of course is that we didn't add that to the list so certainly you can add this to this list with an end in between that we don't miss that one out we can certainly carry on though I just want the sort of accounts of these comorbid disease let's have a look at that and there we see hypertension in 14 of the deaths there diabetes in 11 hearty disease in 4 bronchitis in 3 Parkinson's and 2 connes stensin 2 so let's just look at the difference in ages between those that have recorded a death and those that have not so I'm asking for the ones to be included with missing queue death that does not mean that they haven't died it's just that that information was not captured and if we save all of the ages like of the 2 data we have 2 groups and as we can see I just want the integer values and I'm saving that under the normal function so that's a list and we can do a box and whisker plot of those and first of all as the patients that are alive and the patients who unfortunately have succumbed to the infection and we can see there's a clear difference in the age between these 2 again I don't think the data is necessarily from a normal distribution just to be on the safe side and if we use the main witness test we can certainly see a statistically significant difference in the age of patients who have succumbed and those who have not or at least have been indicated not to have died and there's a statistically significant difference there meaning that definitely age plays a role and we see some of the countries are now making decisions on the restriction of movements based on age and certainly a plot like this that would support that kind of information so that's very brief some of the data that's inside of this data set there's certainly a lot more clinical information that we can get from the medical data inside of here please look at that data set and try and extract some useful information if you do leave a comment about it I am going to leave some information in the description down below so that you know where to find this notebook that you can use it on your own and certainly remember if you do use it in the future your data is going to look different as this is an evolving pandemic and the data set is being updated