 In this very short video on using R for data analysis we're going to use the Johns Hopkins data set of the coronavirus data and I'm just going to show you how to extract information from it and how to do some simple visualizations. We're going to use Plotli for that and just draw some conclusions. Now I'm going to warn you it's based on South African data and I'm just going to use South Africa as an example. I'm also going to compare it to the United Kingdom and to Germany just to draw some conclusions for countries like South Africa which is a month or two behind as far as the case numbers are concerned and how that might translate into what's going to happen over the next two months. Now nothing ages like a YouTube video and I'm going to say that again during the skin cast and you know two months from now we'll know exactly what had happened. So this is not the aim of that it's just a video of how to extract some data and there's lots of ways to extract data from the Johns Hopkins data set and I'm going to show you just one. We're just going to extract some countries and we're going to build a new long-form data frame and we're going to transform our data into there and do some simple analysis the script of statistics at least and some data visualization. So the data that we're going to use is from Johns Hopkins University and they keep it repository. The repository is also available on GitHub and it gets updated daily. Now we've got to be circumspect about this data of course it's not very accurate I think it's the best job that is being done all over the world to collect this data but if we just think locally of course I think in most places there'll be a lot of people that do not come for testing when they are ill and they of course never never registered in the data. There are some seroprevalent studies they are seroprevalent studies right in South Africa and some of them are suggesting a much higher seropositive rate than than the rate of actual positive cases from from swabs and some instances a seroprevalence rate after the first wave of up to a quarter might be even more and of course while we now today very much upward climb as far as the second wave is concerned of course the seroprevalence rate is just going to climb but this is the best data that we have and what we're trying to achieve is not look at the absolute values for any one specific country we're looking at trends and we're looking at comparisons. So you can see the aims here of this notebook is just to show you how to use the R language to deal with the data that you can get from Johns Hopkins at least on their GitHub repository. So this is going to be an introduction we're not going to get into it deeply but there's some vagaries to that data set and you have to know how to how to wrangle the data to get the the answers that you're looking for. So that's one thing just to show you how to use the R language for that the second aim is just going to to use it as a as a teaching resource for data analysis because we are going to get do some data analysis in this notebook and we're going to we're going to use that as a as teaching material how to do how to do and think about data analysis and I hinted to that of course in saying we're not looking at the absolute absolute values we want to see some interesting plots and we want to draw some conclusions from that that's data science. I'm going to also compare countries so this is inherently meant for South Africa and I'm also going to add Germany and the United Kingdom to this as my two countries of comparisons is because those will be two first world countries that I want to compare South Africa to which is most definitely not and those countries bar a couple of million cases people is roughly sort of in the same ballpark figure when it comes to the size of the population and of course we want to express our data per per hundred thousand people or some such some such ratio so that we can compare different countries to each other but if you want to see the comparison with other countries if you use R is Sydney you can you can use this notebook and and just extract some of that data and what I want to talk about and just hop hop back to is the fact that this is data for South Africa and what what lies ahead for us now nothing ages as quickly as a YouTube video and if I have the time and ability Sydney I'll make a follow up to this because by the time you watch this we know you know you would have known what was going to happen so we said right here at the end you saw the date there we said right here at the end of December and you know we're thinking forward to what's going to happen and instead of using in this notebook at least recurrent neural networks as far as deep learning is concerned or other some way to predict the future some differential equations etc to predict the future we we're just going to use what has happened to other countries that are ahead of us in the timeline to predict what's going to happen to us or look forward at least what's going to happen to us the libraries that we're going to use to import into R and remember with these libraries we extend the functions that are available to the language we use the reader a library the plotly library in the dt library so readers can allow me to import a spreadsheet file that contains the data that's from Johns Hopkins and it's going to store it as a table which is easy to manipulate plotly is going to be my library for data visualization because it's it's nice and interactive and dt is just to create nice tables on on an html which is for display in a browser and this document you should see now comes from the markdown file which I admitted into an html file and that's the result that you see here so the data you can see this is the url for johns hopkins github repository as leads as far as where the data is concerned and you can see here the data is in a csv format comma separated values if you don't know what that is remember that it's just a very generic way of displaying spreadsheet data so it isn't a spreadsheet but it is in what we call wide format and we'll see you know there are problems with dealing with wide format data so I'm I've imported the day using the reader from the reader library that we imported the read underscore csv function and then pass the url as a string string because it's inside of quotation marks and I'm passing that to a computer variable that I've called confirmed raw so that's just the raw data as it comes in remember the computer variable names that you use that's up to you just make them descriptive so you know you know when you see that again we've done the line that you know what you were what you were trying to do the names function is this going to give me the column headers the names of the first row in the first row of the data so the the heads of the of each of the columns we all know what a spreadsheet looks like down the columns down that column is going to be this is sort of the same data and what you we can see here the first column states province or state and the second one country or region latitude longitude and then every day from the 22nd of January becomes its own column and that's why this is wide data instead of a long data and long data we would rather have a column for dates and down the rows go all the dates but here the dates are along the columns and so that's why data and we'll have to do something about that so you can see in this data set every day it's going to be different but it goes until yesterday which was the 27th so every time you run this you're going to get a new date if you if you do this your your final value it's going to be different and you're going to have more columns in your data set and we'll have to address that fact and the way that I'm going to do that is the in call function there it's going to tell us how many columns there on the data in the data set as I said every day you do this there's going to be a different number and I'm and I'm assigning this to a computer variable that I'm calling last dot day dot number so that I know sort of what the column value is of the last column in my data set and then I'm going to extract from that the actual date and you can see there this is 12 27 20 so this month day year in that format which is not the format that we wanted in for analysis so what I'm going to do I'm going to extract that very last date in the top row in other words my column headers as an assign it to last dot date the computer variable last dot date and I'm using the as date function so I'm taking names from confirm draw and I'm using index notation here so in that first row which is my column headers the final column in call confirm draw so I could have just said their last dot day dot number because I've assigned that to last dot day dot number but I want to extract that as an as a date so I'm passing all of that as arguments to as date and I'm telling telling R here that at the moment it's in month day year and it's lowercase y because it's not 20 20 it just states the 20 so it's not an uppercase y to say that that is a string and please extract that as a date in the format in which that string is is this little format here and once we've done that we have our last date as a date so let's extract data just from South Africa and I'm going to assign that to computable s computer variable s a and we know the country the ford slash region that is where all the country names are listed and so I'm going to say use index notation again so take the confirm draw data frame and confirm draw dollar country ford slash region and because that ford slash is not a standard character I put these in these little tick marks so I've got to put that in tick marks I can't just use the name after the dollar symbol there equals equals so this is some boolean logic here if it is South Africa please include that in this new variable essays I'm just extracting all that row that contains the data for South Africa and remember comma all the columns so that row specific row that has South Africa in it all the columns because this is wide data now what I want to do is remember the first one was state and the second one is country region and then longitude and then latitude so it's only from column five onwards which was 22 January onwards that we have the actual confirmed cases the numbers so I'm going to assign that to a variable that I'm going to call essay dot cases and what I'm going to do is I'm going to use the as dot numeric function to extract that as a vector so I'm just going to have the list of cases day by day I'm extracting that so I'm saying take what you have to take is from the essay data frame that I have now take row one there's just this one row of data five from five colon last day number so that's from five right till the end and that's why I saved it as last dot day dot number remember that was how many columns I have 200 or 400 or whatever 200 and something whatever the case might be so from column five right till the end so now I'm going to have a vector just of the case values and then similarly I want another list another vector that just contains dates so I'm going to say call it dates and I'm going to assign to that a sequence sq function as date start on the first day end on the last day and remember I saved that as a computer variable last dot date and do that the day by day so I'm going to have 22 January 23 January 24 January right up till 27 December and now with these two vectors I'm going to create a new data frame which is now going to be in long form so my first column is going to be the dates and my second column is going to be the actual values and I'm going to assign that data frame to the computable df and then I'm going to use the data table function and print df to the screen and that's why you can now see we have this in long form so on the 22nd of January we had no cases 23rd of January no cases and this table you can just flick through this table and we can flick through this table flick through it and there we start seeing our first case there on the 5th of March but it's now in long form so I only have two columns the day column to give me the date and an rsa cases column to do that and when I created this data frame right there that's how I got the column headers so I'm saying make a column header called day and pass the dates that vector of dates that we created up here that sequence of dates and create another column called rsa cases and pass to that this list of case numbers that we extracted there using indexing and the as numeric function so now I have a data set that I can really work with and that's the crux of the matter of how to extract that long data into the that wide data into this long format data and this is just one way to do it that's the beauty of our language there are better ways to do this easier ways to do this longer more complex ways to do this there are many ways to do this this is one way which is particularly I think easy to understand so I think YouTube is going to put a little ad right now so if you wanted to take a break now it's the time and I'll see you after the break so let's visualize this data from South Africa we've got a long long format now and I'm going to use plotly so plot underscore l y is the function on my x axis I want from my data frame the day and on my y I want the rsa cases the mode that I want is markers so just little dots the type that I want is a scatter plot and the name I'm going to give it this rsa and I'm using the pipe operator there to pipe that to all of this to the layout function and then the layout function I want title x axis and y axis that's my title that I pass as a string my x axis you can do a lot of stuff to that so you've got to pass those parameters or arguments as a list so I just want to change the title there so I'm giving you the title date and number of cases and then when we draw this to the screen we can see we can see how the cases started here nothing on 22nd of January and I like this interactive plot because every place I hover we can see what happened so there was our first wave sort of come down for a little bit and we wait into the second wave and the last dot there was going to be of the 27th of December where we had reached a million registered cases at least and of course this is all interactive so I can just draw a little square over there and it's going to zoom in and I can also just move it around just to have a look at different bits so I can just really zoom in and of course I can go go back and hit the home button reset the axis and we back to that so it's very nice sort of just to to zone in there on the data that you are looking for and you can of course just export this there as a png file in case you want this in a report so that's it but what we what we want is so this is the cumulative cases what is the case number per day so what I'm going to do is I'm going to take my data frame and I'm going to add RSA daily a new column that I'm going to call RSA daily and I'm just going to take every day's case the cumulative case and I'm going to subtract that from this using the lag function so my lag is going to be set for one day so it's just going to look for cumulative the cumulative values the difference between each day so that difference will of course be the number of new cases the next day so that's very easy to use with the dp dp lyr deep liar package that has a lag function and that's very easy to do so now we're going to plot every day's new cases and we can now clearly see the the the two waves and already the the second wave is is out sprinting you know the peak of the first wave which is about 13900 cases and here we have 14700 new cases registered on each day so you can clearly see what's happening there so that's quite interesting enough but let's take a country a country sort of similar population and we have a look at what happened you know to them because we know we're trailing them by by a couple of months so let's add Germany in the United Kingdom so we're going to go through these exact same steps I'm going to say confirm draw the country is that Germany and that's be included I'm going to extract the values again as a numeric vector and I'm going to create a new column to my data data set my data frame and I'm going to call it Germany cases and Germany daily and the daily remember I'm just going to use the lag function again so we get that exact same information made for South Africa we're going to get for the United for Germany as well and there we can see clearly this lag I mean the first wave started for Germany around around about here in March and we started this climb only here in May so lagging a couple of months behind there and we can clearly see the second wave way out sprinting the first wave as far as Germany is concerned and we can see that you know with a couple of months we are behind now I've got to wonder where are we going to go and the concerning thing here of course is if you look when the sort of steady state was reached when this plateau here the South African plateau even with a smaller population was much higher so where are we going to be in two months time that's the big concern and before I say something about that I think this this this plot at least for us for everyone is quite concerning of course for Germany but for South Africa as well and what we do now is just to plot the the different numbers and as you can see quite a quite a scattering there and the reasons for that for for representing the data here in Germany and then in South Africa again clearly a couple of months later and you can you can see also when we went through this phase after the first wave that there were quite a few number of cases per day as opposed to Germany that went down quite far 422 and we stayed much higher still in the 2000s there so that's that's that's very concerning so there's about a 20 million people difference in the populations I said sort of in the same ballpark but it's not really the other problem that we have with this data not only is it not accurate because not all cases come for testing especially we know here in South Africa seeing that the prevalence rate is so much higher than the actual the confirmed rate is that we also don't have accurate numbers as far as the population is concerned we do know that we have a lot of illegal immigration in South Africa with a subset of those people had gone back to their countries to our northern borders we're not 100 sure this is also population data from 2018 so it's not accurate but as I say the idea here is to compare and to look at changes over time so for that we don't have to be terribly accurate with our numbers so what I'm going to do is say population you can see there in German population and we're just dividing that by 100 000 so that we can express the number of cases per 100 000 people and that's going to give us more of a sort of equal pegging and then I'm just adding new columns here with the same sort of scenarios I'm just taking the total cumulative cases and the daily cases and just dividing that by this per 100 000 people a value that I have and now you can see you know we still in as far as South Africa at least is concerned is still in trouble and as much as when we had this period after the second wave much much higher even even normalizing for the size of the populations and of course we can see sort of catching up in as much as our the time lag between the first wave and the second wave was was a lot shorter and we can see a similar thing here for sort of for the daily cases so let's as I mentioned before let's add United Kingdom to this which has even higher rate than Germany and again you can scrutinize this data remember it's always easy just to zoom in on this data and now we zoomed quite you know zoomed in we can also pan around so I can pan around left and right or we can just go back reset the axis and we're back to a chart so it's nice and interactive and again I think as pertaining to those two countries you know what lies in store for us and once again just here when it comes to the daily cases and you can see for the United Kingdom first wave second wave and a kind of a dip and almost a third wave now and we can think exactly the same thing is going to happen now just yesterday as far as the as far as the timing of this recording is concerned we have had the fact that we are back to level three restrictions we had level five which is total lockdown four three two one and as far as level three is concerned well these things are not that precise because what level three was before and what level three is now is a bit different but at least yesterday there were some changes again some new restrictions brought in to try and get this under control now how much of this is due to a virus that is more contagious and we all see that in the reports and we know in vivo tests suggest that there is better binding to receptors but that's in vivo that you know that does not always translate to what happens in the real world and I don't think there are particular studies that really have confirmed the fact that this is more contagious because suddenly we've had gross changes in the way people interact we've all noticed that people have gone on holiday they gather they don't adhere to to social distancing they don't wear masks properly we've seen that change in the attitude and surely the changes that we see here cannot all be put into mutations of the virus we know there's over a hundred thousand I think mutations now identified in this in the virus so where is this all going to end certainly if the changes that were brought in yesterday from this date at least it seems like we're not in for a rosy January and February as we would hope of course the economy suffers from all of this the healthcare system suffers from this and we all fear time when people will come to the hospital and this is no way to be helped I think the reality might strike for for people then behavior Sydney has to be kept in check there on the other hand the economies have to be kept running in South Africa of course there's going to be a gross delay in vaccination we struggle to pay just the deposit to be part of the covax covax is never going to produce or provide us I think with enough vials of vaccination then to cover the whole population so they're going to be a subset of the population and by the time it becomes widely available in South Africa one has to wonder what the zero prevalence rate is going to be you know how close are we going to get to herd immunity before probably by the end of 2021 where we see max mass vaccinations in South Africa and of course this is a novel virus we don't know where it goes I think the prevailing thought is that it would not that the mutations would not lead to a requirement in the change of the vaccine but it is a novel virus so we can't speak with any kind of surety as far as that is concerned what I think from this data at least unless something major happens that the next two months might not be might not the cases case might not be or change in the way that we would hope but that that's going to rise and it's going to rise considerably as people now have been on vacation met with their friends and family and are going to do new celebrations and we'll have to see what January and February at least has has has in store for us I hope you found this video informative if so remember please subscribe to this channel and leave some comments down below it's nice to interact with the community of people who who watch these videos