 It's Sunday and I'm working at the University of Cape Town campus. We're going to have a cabinet meeting in South Africa today. I'm sure we're going to find out that there's going to be severe restrictions and I'm sure the university is going to make plans to certainly close many of the campuses. What I want to talk to you about is an introduction to two videos. I'm going to use the same introduction for two videos and that's on the data analysis of open data that's available for the coronavirus. Now the Wolfram repository data repository has three data sets that's available. It comes from the WHO and it's updated very regularly. Just mind that when you do the data analysis it's going to look slightly different from the data analysis that I'm going to do because it is regularly updated. It does lag a few days behind so you're going to see data up till a couple of days ago. So we're going to do two videos. I'm going to show you data analysis one on the medical data that the WHO has made available and one on the epidemiological data. So as I said the same introduction video is going to be for both. So I do have courses on the use of the Wolfram language for data analysis specifically for the healthcare related field. The Wolfram language of course lends itself beautifully to data analysis and then the data repository has a lot of open source data curated by the Wolfram research so that we can use the data in our analysis. So there's a lot of data analysis videos Wolfram themselves make some but I want to give you just something from the perspective of a doctor from a surgeon working in a hospital where we are preparing for the management of these patients. So have a look at the videos. As I said one is going to be on the epidemiological data and one is going to be on the medical data and the use of the data analysis. If you want to learn more about the use of the Wolfram language of course I do have courses. You can get a University of Cape Town certificate through Coursera if you take the course on the use of the Wolfram language for biostatistics. So let's go and explore this data set. Some of the data is sparse. It's difficult I think for the world to gather this type of data in as much as a worldwide effort to try and collect this data. So some of the data is going to be a bit sparse specifically when it comes to data for instance about the symptomatology the symptoms of the patients. But it is a large data set and we can still learn a lot from it and that's what all this is about. When I teach healthcare statistics, data science, data analysis I really urge people to get to the story that the data is trying to tell you. There is information lock in that and that information is knowledge and that is exactly what we need right now. So let's analyse some of this data. I'll show you how to use the Wolfram language to do this. So let's go have a look. So here we are in the Wolfram desktop. As I mentioned in the other video that I've made on the coronavirus using the Wolfram data repository data on the coronavirus pandemic that you can do this in your browser and the Wolfram cloud absolutely free of charge but I'm using the Wolfram desktop here on my own system. The first place to go and find out some useful information on the coronavirus of course is on Wolfram Alpha and you can see I've just passed the string coronavirus as an argument to the Wolfram Alpha function here and it's being interpreted as a medical diagnosis and you can get some information there. You can also of course click on there and interpret it as a species specification and you're going to get some information about the taxonomy and the size etc. of this virus. So always remember the Wolfram Alpha there. What we are interested in is the Wolfram data repository and we're going to use the resource search as a function and again pass the string coronavirus to get the information on the data sets that are available the curated data sets on the coronavirus. Now this comes from the WHO I think via Johns Hopkins this data is available as open source data curated here for the Wolfram data repository so easy for us to use inside of the Wolfram language the data is updated regularly but it's not completely up to date so the data that I'm working with here today is up to date for today. It will always be up to date as far as what is available on the servers but it's going to be lag a few days behind the actual date and we can see the three data sets that are out there we've already looked at the patient medical data in another video and I'll certainly put that information in the description down below and here we're going to look at the epidemic data data set and for those of you interested and with knowledge in the genetic sequencing there's a fantastic data set on that as well but in this video we're going to look at the epidemic data so let's open this up and we're going to import that with the resource data function as we can see there let's increase the size of the screen here I think that might make it a bit easier to read let's go to 150% and we'll go full screen here that might make it easier for us to read so resource data and then I'm just passing the name as we can see the name here epidemic data for novel coronavirus COVID-19 and I'm passing that exactly as a string and that is going to import the data to my local system from the Wolfram Data Repository so that might just take a second or two especially way down here in the southern hemisphere where I am accessing this data set from and there we go fantastic data set we can see some of the columns there and it goes on and on and scroll to the right recovered cases, confirmed cases, geo-position, country and administrative data so typical for epidemiological data and we see some of this data is saved as time series themselves so that column with a cell actually has a bunch of values in it and you can see the time series here already until the 13th of March 52 data points when you run the code you're going to see different data points and it's always when you start analyzing the data let's look at the statistical variables column names, column headers and I'm going to use the normal function epi data is the name of my data set the first row of the keys that I'm interested in and we see the statistical variables the administrative division, country, geo-position, confirmed cases, recovered cases and deaths now as always with the data set that comes from the data repository you might be interested in some more information on that and we can use the resource object function pass the name of the data set and I'm going to store this in a computer variable called epi data resource and that resource has a lot of properties in it and always when it has these properties I like to use the manipulate function and I pass this list as something that I can select inside of manipulate so I'm using this temporary computer variable here in and we can just select any of those properties using manipulate so I can drop down there and just click on any of these and it'll give me the information from that property instead of having to type in all of the different properties every time, so I've done something I like to do so as with epidemiological data let's start looking at a couple of countries and let's have a look at how there's been this increase in the cases as a pandemic as any pandemic will develop when we talked about the medical data I did mention and I'll mention it here as well even though if we scroll up we see these strings here, China, Sweden that is not how it's saved inside of the data set it is saved as entities and if you just look for the string value there you're not going to find that so we have to save this as an entity now there's various ways to extract an entity just using freeform using the entity function but I'm using the interpreter function here as I'm saying please interpret Italy as a string as a country please so that if I run that and I start in a computer variable such as Italy this entity here is what is saved inside of the data set so that is what I would have to search for so let's create this new data set that only has data on Italy inside of it how would I do that? a couple of ways that I can do what I'm going to do here is just use notation straight from a data set so I start with a data set name and then I'm going to select and I'm going to say match queue so in the medical data I actually use different syntax to do exactly the same thing so there's quite a few ways that you can go about it so here's another way for you to go about it so I'm saying match queue so it's only going to select it if that is indeed included and what do I want to match I'll go down the country column and I use the little ampersand there to go down that whole column row by row therefore we use the hashtag symbol there so hashtag country and the country we're looking for is this entity value Italy so it's only going to include data from Italy so let's run that and we see there's only one row dedicated to Italy some countries you'll see that it goes by province or territory or state and we'll get quite a bit of information but for Italy it's all captured in the same way and we can see the time series here and we can see little graphs so we can see this rapid growth in the number of cases there so let's have a look at that confirmed cases I want to extract that so I'm going to say Italy epi all the rows see the double square brackets here so I'm using indexing row, column so all the rows are concerned give me all the values and then in the confirmed cases so what we are going to get back there's only one row so it's only going to give me this row this column so it gives me back this time series now the time series data that comes from all over the world is being captured by different people and let me be brutally honest there's a bit of work to be done when you want to extract this data first of all let's just do a dateless plot on this data and so we're going to say dateless plot Italy confirmed cases T.S I'm putting that inside as you can see here inside of curly braces to make it a list and the image size is large and I've just done some plot labels there and this is the unfortunate picture that we can see here starting on 27th of January here until after the 9th of March we can see the scale here the number of cases very very very very unfortunate we can see this rapid drives in the number of cases there let's look at the recovered cases so let's just extract that time series I'm just using indexing there so all the rows in there just a single row of recovered cases and now we can plot them both so let's plot them with a show function so we can do two dateless plots in exactly the same plot and we can see the lag behind in the recovered cases as far compared to the recovered cases compared to the confirmed cases and we can see also the date of this information lags a bit behind but certainly there is a good rise here in the number of recovered cases being documented now I want to extract the numerical values only so the date list plot has done a wonderful job of just extracting the actual values and the dates so that we can have this beautiful lines but to do it by hand takes a bit of extraction unfortunately there's also a bit of discrepancy between each of the time series you have to extract these values in a bit of a different way and you'll have to play around you'll have to play around a little bit so I'm going to, for this instance I know what works here, I've looked at it before I'm going to use the apply function and I want to apply list to the Italy confirmed cases TS time series now if we look at here this is what I've saved it as Italy recovered cases TS and I just went for that cell in that column so I've got this time series object there and this is how I'm going to extract it I know that that works well in the extraction of that and now this is what I get time series 52, if I click on that I see inside of that I have the actual numbers which is what I'm after now if I can spell this correctly that N should certainly not be there the other way to construct the supply function of course is short annotation and to say list at at at Italy confirmed cases TS was my was my variable name there so let's just correct that all now because I'm going to put some information in the description down below so that you can download the notebooks for yourself so I have this in this format but it's still not what I'm after I'm after this little bit here, that's what I'm after so all I'm going to do is ask for this again just by short hand and pass the normal, pass that to the normal function and let's see what I get from that now I get the nested list where I have a date object here and the number of cases I'm certainly getting closer to what I want to do and that's what you have to do with this data information just play around with it a little bit until you get back exactly what you want this is great for me because I'm going to pass this as the values at least to a time series model function, there we go because what we are going to try and attempt to do here is just to create some model now there's certainly not enough numbers here it's too early in the pandemic to just use very simple time series model fits here to predict the future but certainly there's enough data here for us to attempt this and I'm going to simplify it I'm just going to use a couple of functions here but then just to give you the idea of the power that's locked up here and that you could unleash so I'm returning exactly this list I could have saved it in a computer variable but I haven't so there it still is the normal and then do all of that but what I'm extracting is only the numbers so it's the second of the set of elements the nested values so I'm only extracting those numbers so I'm just passing all of those sequential numbers 0000 you can see it started as zero cases in Italy on that day 00 and then it started to ramp up so I'm only passing that specific sequence of values to the time series model fit so this is time series analysis very complicated topic and what I'm not going to do here is pass a specific type of model here so I'm not going to say auto regression or moving average or auto regression moving average or integrated version of that or Serima which is season both seasonality into that if you know anything about time series analysis you can certainly specify here what time series you want I'm just going to let the Warframe language decide on its own what to do and the values that I do give it and you can see already it's giving me back a time series model with a bit of information on that and it's chosen the seasonal Arima for us as the method for constructing this model now it has a bunch of properties again remember you can use the manipulate function to go over all of these but that's all the properties that are available in my model I can get more information on the model by just passing the model to the normal function and you can see the information on the Serima process the function side of the Warframe language and you can see some information on that there so what I want to do now is use the time series forecast function there we go the time series forecast function I'm passing my model which I've saved under TSM Italy and I want the next 10 days can you predict for me can you forecast the number of cases that we're going to see in the next 10 days after the data which this data stops I'm going to pass that to a list line plot so first of all I'm going to do the dates that I do have and that is this property temporal data of my model so that's the actual values that I do have and then I'm going to use time series forecast to plot a prediction of what we might see in Italy and where we stand in the epidemic and with the Serima model unfortunately that's what it looks like pretty sure that's hopefully not how it's going to pan out but this was all just set by default and certainly if you are familiar with time series analysis you can do a much better job at this but I wanted to show you just how easy it is just to then plot the data that we do have and then the forecast the forecast that we can make from a model so certainly as far as this very simplified version is concerned those cases of course does not look good at all now let's go just in contrast to that to a place where we have already seen that the number of cases has started to plateau and even come down and if we go right back up to the data we'll see the first row there is on Hubei so I'm going to call it Hubei confirm cases time series so I'm only extracting that time series from row one and there we go there's indeed a time series and you can see the little plateau that has gone there now to extract these values from that time series I had to do something different ways so instead of what we did for Italy this is what worked for me again just applying the list function to those values and I've had to use this indexing to extract that and then I passed that all to the flatten function so that I have this number of cases that we've had before now this time I'm going to pass all of these numbers to the time series model fit again and this time I'm specifying the method that I want the time series analysis to be and that's going to give me my model remember I can extract all of those things from the model but what I want to do again all the properties but what I want to do is a list line plot of the actual values that we have now these and then the time forecast of what is going to happen in the next 10 days and we can see certainly the difference here in Hubei province in China as far as the forecast is concerned so please do a much better job than I've done here I just want to show you how easy it is to use the Wolfram language to do time series analysis I couldn't help myself I just wanted to look at the United Kingdom as well through family that I do have in the United Kingdom so remember I have to pass this as or find at least an entity value for United Kingdom I'm going to use the select function on the epi data again so that I only have a time or data set at least that has the United Kingdom inside of that data set and the one that contains the real data here is this first row is what we are interested in so let's look at this time series of this first row so I'm just using normal indexing there row 1 and then the confirmed cases and we pass that and I'm saving that as a computer variable and let's have a look at that plot where we stand for this data set as fast UK's concern and unfortunately looks the same as in other places in that we see this this jump now I'm going to do this all in one the extraction of this again a little bit different from the way we did for Italy before and for Hubei province but I'm also just going to you very very simplified I'm just going to use Serima there I have a model and we can again look at what the forecast looks there for the next 10 days after the the values that are available here so certainly again predicting this increase one more let's look at Australia entity there we go Australia I'm just going to have Australia as the selection criteria there and we can see that it was broken down into all these different territories New South Wales Victoria, South Australia Western Australia etc and I just want to show you how easy that would be for us just to have a look at the time series of all of that so have a look at the line of code there what I wanted to show you there is just if you click on any of these we're going to see that time series for that row so the time series for that row this is how I extracted it in this instance and if we do a date list plot on all of those combined I'm saying up to number 7 because certainly if we go look here this last one I don't want this empty row here to be included but there we can see the numbers for the different territories there and we can see the Northern Territory there with the at the bottom with the lowest infection rate one more thing we might have a look at that confirmed this epidemiological date of course is the different countries so what we've done here is to start with the epi data all the data in the country column please return that for me as a list so I've passed that to the normal function here delete all the missing values for me delete the duplicates and then sort sort it for me so if we do that we can see this alphabetical sort of all the countries that have been affected and just look at that at that sorry list unfortunately very few places have been spared from this epidemic this pandemic I should say let's plot this as a geographic and we can certainly see here Antarctica seems to have been spared but everywhere else looks looks pretty much unfortunately affected by this I'm going to group by country so I'm going to use epi data so that's normal data set notation epi data and then the group by function so group by country and I want the totals and I want the confirmed cases and from confirmed cases I just want the last value because confirmed case is going to be a time series but I just want the last value inside of that which will give us the latest view as far as this data sets concerned so there we go China, Italy Iran, South Korea, Spain all the unfortunate things we've seen in the news there so let's just go for region plot here Joe value region plot and if we set the plot range to about 5000 it's going to give us a good difference between all the countries as far as to differentiate in a map of how the different countries have been affected and you see the legend here on the left hand side so yellow would be for less than 1000 cases and then the higher up we go we do know that China was the first epi center of course unfortunately Iran there and then of course Europe that has become the new epi center for the continuation of this pandemic so let's just have a look at the United States if we want to use this epidemic data of course there's the entity and I'm only going to select cases from the USA we've seen how to do that now with a select function and let's do a geo region plot of that I am going to group by the administrative division that is going to give us the the different states I think Washington, New York, California Massachusetts there we see the different states and I want the last value again of the confirmed cases and the plot range we're dealing with a lot fewer patients here when we compare that to the whole world so I'm going to set the plot range to about 100 so we can get some idea of how the different states in the United States have been affected and you can see Texas there, you can see Miami you can see California with the highest number of cases unfortunately let's do the top 100 countries that is always a plot that we might want to include when we talk about something like the epidemiological data so I'm going to sort the epi data I'm going to group by country again I'm going to take the last value and we're only going to take the first 11 so I'm using indexing there it's saying give me the first 11 because I am just doing that group by and I only want the first 11 back otherwise it becomes too much to plot so there we go the top countries there now when I ran this data set before there's certainly been a cruise ship or two that has been included in this and I see in the latest data that has been removed and if it still shows up for you what you can do then let's just use this key drop function so key drop function you can specify then what you want what key you want dropped and that for me is not a country cruise ship was one of the countries that did appear in this list so certainly if you want to remove that so normal of course is going to give us this the entity themselves and then we're going to see how many cases there are again we could drop that in association if that was available if the cruise ship still appeared cruise ship still appeared here because we're just interested in the countries but now we can just do a normal bar chart with this association I'm sitting the plot team to business here it's a well-drawn plot here and we can see China here at the bottom still with the most number of cases and then we can see Italy etc just a little bit of code here if you just want to extract the values separately there's the countries and I'm converting them to names here so that we can add them as far as the plot is concerned we can extract from the association that we have there the values so now we have the values and the name and we can use a bar chart a different type of bar chart so that we have a legend on the right-hand side that shows us the different countries and colors so two ways for us just to set that up with the numbers in case you wanted a different look for your plot there's some useful information for you there how to extract that so all in all pretty grim from a medical point of view from a surgeon's point of view where we are looking in my own country about how to deal with this pandemic and what it might mean to our health services it's great to have this information available the Wolfram language makes it very easy for us to extract and there's certainly a ton more information that I haven't touched here that if you have some questions that we can perhaps sit together I'm quite happy to help you through this data set and to extract some of this data just leave some comments down below you know how to contact me this is also done for my students on Coursera where I have three courses one on medical statistics introduction to medical statistics which is the leading course of course on medical statistics and the massive open online courses are concerned I also have a course specific to bias statistics using the Wolfram language and for those students I hope I'm going to leave a message on course here for you to watch these videos and that you find the successful as I said there's a ton more information on here some information that we every one of us can use locally and certainly some of this data and the results there of that we will look at locally in the hospitals and the academic complex of the University of Cape Town where I am a surgeon