 Okay, it's a wonderful Saturday here in Cape Town, almost summer, lovely weather out, Saturday here on the University of Cape Town campus. Now for quite some time I've wanted to put together this video series just on using R for biostatistics. Now in my research unit, R is not our language of choice. We use the Wolfram language most of the time. We also use Python quite a bit but R is something that we use from time to time. We have many students come to us and they're really working in R. There are many international students working R and R is actually it's a language for statistical analysis and it is fantastic to for biostatistics. Don't get me wrong. I think it's probably one of the best and we do use it under these circumstances as I mentioned. So I've always wanted to put together these videos on the use of R for biostatistics. So this first video is actually going to be quite long. I just want to give you this whole introduction to R in one quick go. Now that makes it a bit difficult because I can't tell you everything in one video and it just really gets long and it is a long video. But I really want you to get an understanding of what R can do for you as far as biostatistics is concerned and if you really if you want to use it and so we're really going to run through the installation of R, how to create data, how to create simulated data, how to do the descriptive statistics, how to visualize your data and then how to analyze your data. But even before all of that, I'll just show you how to code in R. So get some feeling of what this functional type language is all about. So really R is a fantastic language for for statistical analysis for biostatistics and I hope you enjoy this very long first introductory video. Now I've already made a few others just on some visualizations, some on logistic regression. Please find them. The data files that I'm going to use they're going to be available on github so you can just download them. I'm going to tell you all about our pubs where you can publish your documents. It's HTML documents, really all a fantastic powerful tool R in R studio. So I really want you to enjoy this. Let me know in the comments if you see anything that you want more videos about. We use R as I mentioned. You can do anything in R as far as biostatistics concerns. So really if you are interested just let me know in the comments and I'll make more videos. And especially for all the international medical personnel that you visit us in our unit and who want to use R, no problem whatsoever. So really the first thing that I want to talk to you about is the installation of R. Now we mentioned R is a programming language specifically based on doing statistical analysis. And the way to get R if you are on Windows or Mac OS is just to go open your favorite browser and just type in R spaced CRAN and we see CRAN is the comprehensive R archive network. So if you just type that in the first link that you're going to see probably is CRAN.rproject.org. And if you go there, this is what you're going to see a very austere old fashioned kind of website. But everything is there for you to download and you can see the Linux Mac OS and Windows. And if you click on those, you're going to get a file. You're going to download it and install it as you would any other program. Now you can install or get at least the files to install R for Linux here. But if you're on a Linux system, I would really just suggest that you use the package software that comes with your distribution. There is going to be a type of app store for your distribution and R would usually be there. It's usually called R-base or R-core, all the distributions give it a bit of a different name. But if you search for that, you are bound to find it and it's much easier to install it through your distribution than it is from downloading a package and installing here. Unless you know what you're doing and you are familiar with installations on Linux. So once you've installed that, that is just going to give you a user interface that really just allows for typing at a terminal. And what we really want is a graphical user interface. So after installing R, I want you to search for RStudio. And you can say RStudio, R Space Studio. First one that's going to come up probably is RStudio.com. Let's have a look at that. And it's in a development environment that will see where your installation of R is on your system and it'll use that. But in a nice graphical user interface, and I'm going to show that to you shortly. Yeah, you can learn a lot about RStudio though. There are even paid versions here. We're not interested in that. We're just going to download RStudio there. That's the free version and you're going to install that. You can read about some of the packages that the RStudio company, if I can call them that to develop. You see R Markdown, Shiny, Tidier, Nitter, Ggplot2. They're all going to be there available for you free of charge. But you can look at those packages and really the way that R works, there is this base language itself and that is what you downloaded from Cran. But it is extendable. In other words, people can come and write extensions to the language that just makes your life so much easier. Shiny, for instance, is something that you can use to develop programs. Tidier, very nice to wrangle your data. If you bring data in, you can really clean it up nicely. Ggplot2, they're a very fantastic plotting library. Now, I'm going to show you, you can plot just with base R. You can create very nice plots and graphs. But Ggplot2 can give you even nicer looking graphs. And there are really thousands of these packages. Very easy to install, very easy to then just use an R. But they just make the abilities that R has to help you with your statistical analysis. It's just so much bigger with these thousands of applications or packages that you just then add to R. Another beautiful thing of R is this R-Pubs website from RStudio, rpubs.com. And if you are here on R-Pubs and you refresh your your web browser every now and again, you'll just see new files coming up, coming up, coming up. If you're in RStudio and you create a file, which I'm going to show you, you can upload it to this cloud website and everyone can see it. So it's very nice if you want to collaborate with people. People can see your work. You can collaborate with others. And you can also just look at other people's work here. And that's a brilliant place to learn a few things. So for instance, a lot of the work done in my unit, we do put out here on R-Pubs. And this video is all going to be about introducing R for biostatistics and the files already being created. And it lives right here on R-Pubs. So if you go to this Jean H Klopper, that is rpubs.com forward slash J-U-A-N-H-K-L-O-W-P-E-R and then slash intro to R for biostatts. All those words with underscores between them. But just go to Jean H Klopper and you'll see all our files there. And this is the file that we are going to grade. Let me hide the toolbar, give us a bit more space. And this is what a document looks like that we created in R. It was very nicely formatted here. You see that we have a title here, a subtitle. We see a table of contents and we can even click on those that is going to take us down to this. We put a nice logo in there and another subtitle section here with nice text written in and a whole document that I created for you. And this is what we're going to run through. I'm going to do some talk to you about the libraries, some simple arithmetic, how to grade lists, what computer variables or objects are, how addressing works. We're going to talk about distributions, descriptive statistics, visualizing our data, tibles, which are just data frames. You can see that as very fancy spreadsheets and we can import spreadsheets as tibles. How to import our data, how to inspect the data, how to select out of your data only the things that you want, descriptive statistics on this new data of ours, visualizing this new data and a bit of inferential statistics. So I'm going to show you all of these. You can just look at this document here, but I'm also going to show you how to create this document. And this video is really for you to decide will our solve your problems? Would it be a good tool for you to use? So I'm not going to go into too much detail. I mean, this is going to be a long video, but I just want to give you a nice introduction to for you to decide do you want to get into our. So you can come and have a look at this file, but this file is also going to be available on GitHub. And here we are my GitHub site. So that's John clopper. That's without the H and then we have forward slash R underscore statistics. So all the files that I've created so far live on the skipped GitHub repository. And you can just clone or download it. If you don't know how cloning works and get, you can just download the zip file. And it's always going to give you all of these files. And if we look down here, we can see intro, a quick intro to buy statistics.RMD. That's the actual file. And if you download that and you can open it in our studio and you can see the actual file now that you've installed our and our studio. You open our studio. This is not what you're going to see because this is the file that you've just seen in our pubs. What you'll see is a blank slate. And what you can do is just go to file new script or just under file, you see the little plus sign there, R script. Now, that is certainly one way to enter data into R. I can type in two plus two. And if I hold down control or command and hit enter, you'll see the console opens up at the bottom and you see the two plus two and you see the solution four. You also see this little one inside of square brackets next to it and we'll come to that a little bit later. So this is one way to work with R. You can also work right down here in the console. I can type two plus two down here and instead of hitting control and enter or command and return, I'm just going to hit enter and we see the four appears right there. The difference between this console and this window up here is this console is not going to save anything for me. I can do some quick calculations there, but it's not going to save anything. This file up here, I can save that file and I can reopen it and everything I typed would be back there. So these are the two main parts, but let's explore a little bit more. On the right-hand side, quite a few other things we see an environment here. And if we store, start storing things in our file, those are going to appear here. A history of what we've done before. Connections, we're not really going to be too concerned about in this introduction. The importation of data sets, we're going to do that right in the code. We won't use that. Very importantly, the tiny little right bottom here. Files is just going to show you the file structure of where you are at the moment. Plots, if you were to make any plots in your code here, it's going to appear in this little section down here and you can even export your plots as different types of images, JPEG and PNG, et cetera. Packages, very, very important. This is where we're going to go when we want to install packages. Remember I said thousands of packages exist and they extend the base R to allow us to do a lot more. And we're going to click on install and then here, type in the name of the package we want to install. Leave everything else as it is and just click install and those packages will then be installed on your system and you can just import them. And I'm going to show you how to do that every time you have a new file. You can just import all the packages that you have installed. Help, absolutely fantastic. A whole set of documentation and I'm sure if you had to print all of this out it would be a very thick textbook indeed. So all the documentation there for you and viewer is going to give us a view of certain things and we might get to that right at the end we'll see. So, definitely this is where you can go. You can also then just click on the save button there and save this file. That is not what I want to talk to you about though in this introductory session. I want to talk to you about our markdown. So if I click on the little plus sign and we go down the third option there is our markdown. It's going to say the following. Do you want this to be a document? Do you want it to be a presentation? And that is like PowerPoint, shiny. Remember I said that's an application or there's even some templates that you can choose from. So we're going to stick to document and you can give your document a name, let's call it test. You can put in your name there and then you can say the default output format you want that to be a web page and that you can either export to your own website or upload to our pubs. Do you want the export to be a PDF or do you want it to be a Word document? So lo and behold you can write all your code and you can export it as a Word document or as HTML. Now you can change this after the fact after choosing this so just leave it at HTML no problem there whatsoever. And this is going to open up this default new file here which you can then go and save as well. So if I were to click on save it's going to open my folder structure on my hard drive and I can decide where to save it. The first bit of code that you'll see at the top is these three little minus signs and it ends on line six with another three little minus signs and this part is called YAML, just another markup language and that just tells our studio what to do when you do export it eventually to a Word document to a PDF document or to a HTML document. It just gives us that bit of information and I'm going to show you how to change that a bit later. Now here you see the three tick marks, tick, tick, tick and then curly braces and then R and then a space and then setup and then a comma and then include equals false and then knitter, K and I, T, R and two colons, options, chunk, dollar, set, echo equals two, close the three little tick marks and that is called a code chunk. So if we go back to just this normal, just this normal file that we did to the R script everything you write here is code but if you do an R markdown you've got to put your code inside of a chunk like this. Everything outside of a chunk is going to be viewed as normal text as you would have on a web page or a Word document or PDF. So you've got to embed your code inside of a chunk and I'll show you how easy it is just to create a new chunk and when we go to the right hand side here you can click on this little wheel and it's going to give you a bunch of options for this chunk and I'm going to show you all about that as well. What we see below here though is now just normal text. This is outside of a chunk and you see these two little hashtags here or pound signs or symbols and because we are exporting this to HTML as it stands that tells the website how big this text should be or if we export it to PDF or Word how large also this text is going to be and these little marks here these little hashtag or pound signs you get from one of them all the way up to six of them one, two, three, four, five, six of them six being the tiniest little sub, sub, sub, sub, sub title and just one being the largest that's a title but you see title up there that's just going to put test in this biggest form it can possibly be now I can change it here even though we put in test doesn't matter I can change it here to something else that is by default going to be H1 or just the one hashtag and two hashtags mean slightly smaller that's H2 now or header two, a second size header and then you just get the normal text you can put in hyperlinks by just putting them inside of these less than and greater than signs you see more code chunks there I'm going to show you all about these so we can mix this text we can mix different sizes of titles and subtitles and we can create this research document so just imagine this you produce a document of statistical analysis but in between you can also just write normal text as you would any normal text editor like Microsoft Word you can even do spell checking by hitting F7 on your keyboard or going up to edit and all the way down there F7 you can see check spelling so we can check your spelling really phenomenal piece of software R studio it is basically a web browser in the back back in the parts that you don't see but the parts that we do see here the graphical user interface fantastic to work with so again as I said imagine you can put together this research document that you can share with others you can export it as HTML and share that put that on your website you can share it on our pubs or just print it out as a PDF or Word documents share that with your collaborators and it's got mixed code and the results of that code so your statistical analysis your plots your graphs and just normal words in between so it's a beautiful research document so let's just go to the document that we created so what I do is when a new document like this is opened up I would change a bit of this and I'll show you now and I'll come right from here right to the end select all of that and delete it that's how I like to start that was just a bit of template that RStudio opens for you just to show you around but this is where I like to start I know that I'm going to have a nice big header there that's from the title in my YAML and I'm always going to start with a subtitle so that's two little hashtags or pound signs and now I can start typing because if we look at the document that we're going to work on looks slightly different I've added a few things and I'm going to show you all about those but this is where it really started with these two hashtags and I have introduction there and we start writing the introduction but I'm going to show you all about all these changes that I made just so that you're familiar with them and you can incorporate them in your own documents so let me show you how I like to set up my documents so here's our title so I just changed it to a quick introduction to R for biostatistics the author is this there but then I change this output so if we looked at the output before I just said output HTML and it had the date there so I took the data out and then I put output colon and then on a new line HTML document and then TOC colon space two that is going to give us a table of contents and the number sections I want false if you put that to true it is just going to give you one 1.1, 1.2 then two 2.1 as your subtitles, sub-subtitles etc. go down and that's not what I want for my document but you can certainly put that to true so this would be a default for me now the first thing I did here is just for the sake of this introduction is just I created a code chunk now the way that you can create a code chunk is just to go up here to code and say insert chunk but you can see the keyboard shortcut there on a Windows and Linux machine that's going to be control alt I and on a Mac OS that is going to be command option I so let's do that I can just come in here there we go let's hit enter or return create a bit of space there for ourselves again there we go I'm going to hold down control alt or command option and hit I and you see there we have a code chunk so I can start writing my code inside of this chunk now and that would be completely analogous to the way that we just entered code here in a normal script file and we can run this file and that would execute for us line by line we see line one number one there is going to have two plus two that's going to execute and give us a four and then line two and nine three etc. but here in an R Markdown file we have to put it inside of this chunk now if I go to the little gear icon here immediately you see I can give this chunk a name and let's give this chunk a name and let's just say just call it name as I type name there in the tiny little block you see name appear there so the first thing after this R in a space is going to be the name of this chunk and it's very good to give your chunks names because when you give this file to someone else or you look at it weeks and months down the line you can remember what that code chunk was all about now you can also decide on the output now whatever you do here is going to be reflected by some code that is generated on the side and after a while you get to learn what this code is all about and you can just type it in there and not come to this little gear icon but you have a few options here for output you can say show output only that means when we do create our word document our PDF document or an HTML document at the end the actual code is not going to appear only the output of the code so you can imagine a situation where all you want to show is just some words and text and headings and subheadings and you want your graphs to appear not the code that generated the graph in the plot you would choose that show code and output and that is going to show the code that we write in the output show nothing but run the code sometimes we have to execute some code get some results which are then used later and we just want to run those but we don't want those to appear in the document you can choose that and then show nothing and don't run the code don't know really why you would use that but there probably are scenarios which you can now you can also choose not to show warnings and not to show any messages sometimes when we import packages they're going to overwrite some of the base functions that are inside of R or they're going to overwrite each other's functions if we import more than one package and it'll give you some warnings and some messages and those can just be a bit annoying especially if you know that they're going to appear and you know that it is going to happen and you work through that it's not a problem for you you can just untick those and nothing will happen use page to tables we never use in my unit and use custom figure size you can certainly look at that if you want to but you notice when I untick those we had comma there and then message equals false and warning equals false as I said after a while you can just start typing those yourself if I hit apply we can start writing here now what we've seen see down here is exactly what we are busy creating here I called this one comments and I said include equals false I didn't want this included in the document that is actually going to be produced on the end now here's some actual code but you can see the beginning of this code has a hashtag or pound symbol that has nothing to do with the heading and the size of the heading because this is inside of code inside of code inside of a code chunk it means something completely different it is actually a warning to R itself that this is just a comment line everything to the right of this on that line is just human comment and R will totally ignore it so if we go back to just a normal script I can do the same here and say this is a line of comment and that's it and it's going to be completely ignored when the execution goes line by line that's completely ignored and again it's a very good way of just leaving some information for your future self or for your collaborators that if you want to let yourself know someone else know what this was all about so do make liberal use of naming your code chunks and putting these lines of comment in now why name these code chunks another good reason is if you go right here right to the bottom and you click on that your whole document actually opens up and because I've named every chunk you can see their chunk 52 has mean age by group so it's very easy for us to go find something later on that we were working on instead of scrolling down this long file we can just quickly go look for that and you see here in darker text that's tiny here on the screen hope you can see it is all the headings and subheadings will also appear here nicely tabbed out so that you can see how this document was put together and if I remember that I quickly wanted to go to addressing again I can just click on it and it'll jump to that section but this only works really if you name your chunks and it will automatically see these symbols here that for your headings will automatically see those but name your chunks otherwise it's just going to say chunk six, chunk seven, chunk eight and you're not going to know what it's all about so just get into that habit I prefer to do that now just in this last little bit on setting up this document two more things and that is if we are going to produce an HTML file we can actually bring in some cascading style sheets and I can write normal HTML just by putting things inside of these less than and greater than symbols so I'm just going to put a bit of a style in and so my opening style HTML tag there and then my closing style tag there and then anything in between is what we actually want so the type is text forward slash CSS cascading style sheets that's just normal we just put that in and then all I want to do is hitting one, hitting two and hitting three so that's akin to the single hashtag in markup language the two hashtags and the three hashtags so that's the very big text the title text, the subtitle and sub subtitle just gets smaller and I've just given them each a colour the colour is an argument there and that goes inside of the set of curly braces colon then and then in hexadecimal code starting with a hashtag the value so that's a navy blue that's a gold and that's a slightly lighter sort of navy blue so that gives us a colour and you can bring in all sorts of other things if you know HTML and you know cascading style sheets you can really go to town on your web pages and really change it up and make it look all fancy Last thing for this section before we start with the actual introduction is you can also bring in some logos now one thing I want to say about that is that we are going to save this file somewhere on our hard drive or on our solid state drive we're going to save it some way and what I like to do is all the files that I work with together with this notebook this rmd file this markup file I store in the same folder on my hard drive everything lives in that same folder and that means I don't have to go type in the long address bar c colon backslash my documents backslash etc etc and on macOS it's going to be different I keep it all in the same folder and then I do the following now remember this rsetup includes the files that was automatically inserted this nitto options chunk set echo equals true that was there automatically what I like to add there is this little line of code setwd getwd once I've saved this file on my hard drive I can let our find out where on the hard drive this actual file is by this text here getwd open close parentheses get working directory open close parentheses that is actual rcode it tells our go find out where I am at the moment so you have to save the file first and then you can run this getwd and it's going to return this long string of c colon backslash or macOS is a bit different so it knows where it is and I use that inside of another function so you don't know a lot about functions at the moment so just roll with it so I've got setwd and it's got its own set of open and closing you can see the open close parentheses there and this getwd with its open close parentheses lives inside of the setwd's open and close parentheses now these things are called functions and what goes inside of these parentheses are called arguments so this getwd without an argument this is open close parentheses that is an argument inside of the setwd function so I'm passing this argument to the setwd and that just means that long address that getwd got for me it's passing that as an argument to setwd that tells this file we are using this little part of our hard drive where this notebook is saved or I should say this our markdown file is saved as the default so anything that lives inside of that same folder I can just reference directly because this file knows where it lives now and now I've got this nice little png here we see it there it's called krg elegant logo for lightbg.png it's a png image file and I don't have to put the long address to get to that c, colon, backslash, images whatever and your c drive or whatever drive it is because we've used the setwd this file knows where it is in the world and because this file and this png lives in the same folder I can just type the name of the file and if I had a spreadsheet file I would also save it in that specific folder that is the way I like to work it just makes life so much simpler and the way that you get a logo into your word document in the end or html document in the end is by this exclamation mark open and close square brackets and then in the set of parentheses you just type the name of your image file .jpeg or .png whatever it is and now when we export this that actual logo that you saw before that will be there as well so that's it for the how to set up the way that I like to set up my files and my research unit and it's a structure that we use and this will be the default and you can actually just save this as a template and just open the template every time next up we're just going to start then looking at libraries so let's talk about these libraries now all I'm going to do is just a console here at the bottom I'm just going to click this little minimize button you see minimize and maximize I hit minimize it goes away I can hit maximize again it'll come up so let's just talk about libraries I mentioned before that is this these packages that allow for the extension of base R so that you can do so much more the ones that we're going to talk about is we're going to use here are these five Tible Reader, Deplier, DT four of them and they're not going to be there by default on your system so please go to packages and then install and then you can type them in one by one Tible Reader, Deplier and DT and just install them one by one that's the safest to do and they'll be available they are on your system now but every time you run a file you've got to actually just tell R to go use them and the function for that is this library every function is followed by a set of parentheses and inside of those parentheses we put one or two or three or more arguments we separate those arguments by commas but each of each of these just will have one argument that's Tible, Reader, Deplier and DT and you just type their names as it appears there now to run a code cell we've got to actually run these so if I run this first one here nothing is going to happen whatsoever I just want to delete this one so I don't save it in my system let's just neaten things up single open line there you can just click on this little run button and that will run the code you see a little green line appear there and let's just run this one there's nothing in there so comments and nothing is going to happen this one actually has something in there so let's run this code so the knitter options have been set and the set working directory get working directory has been set up so I can just click as I said on this little run button alternatively I can hold down Shift, Control, Enter or Shift, Command, Return and if we do that if we just inside of the cell that is going to execute as well so that's just the shortcut keyboard shortcut Shift, Control, Enter Shift, Command, Return and all of those libraries are now part of R and I can just use all the functions that are inside of those libraries so let's start with simple, simple arithmetic and so I've got a bit of text there simple arithmetic I've written a few lines of text what I actually would love for you to do is to go on GitHub and download this file for yourself and start making your own notes so go in between these sections make a little space for yourself and just type in your own code my own notes and just start doing that throughout this I've kept the notes very sparse in this document because I'm just talking about all of these things but I would love for you just to download this document and just type in your own notes in between so addition we've seen before very simple 2 plus 2 plus 4 I'm going to hold down Shift, Control, Enter Shift, Command, Return and we see execution there it's 8 and again this mysterious little 1 and we're going to talk about that subtraction just like a normal calculator very easy multiplication remember we use we don't have a multiplication sign on a keyboard so it's Shift 8 that little star symbol on most keyboards let's execute that we see the answers 24 let's do a bit of division 3 divided by 4 is 0.75 powers we use the little carrot symbol Shift and 6 on my keyboard 3 to the power 3 is 27 now we can also mess about with the order of arithmetical operations remember we do division and multiplication before we do addition and subtraction so if we want the addition to happen before the multiplication we actually have to put that in parentheses so this says do the 2 plus 4 first and then multiply by 3 so that's 6 divided by multiplied by 3 that's 18 otherwise if I didn't put those it would be 4 times 3 which is 12 first times 2 plus 2 is 14 but we're going to get the 18 here because we force that just some other functions something that we might use a lot in statistics just the exponent Euler's number e to the power 1 actually gives me e and e Euler's number we write in R as exp exponent and if we raise that to the power 1 we're actually going to get Euler's number 2.718282 so we see there 1, 2, 3, 4, 5, 6 decimal places log base 10 is log 10 is the function log 10 and I pass an argument to it in this instance it's 1000 so I'm saying what is the log of 1000 and the log is base 10 here and 10 to the power what gives me 1000 well 10 to the power 3 so log base 10 of 1000 is 3 now the normal log function without the 10 there that's function will give me by default Euler's number E as the base so this is actually the natural logarithm but I can actually specify the base so I could say yeah base equals 10 as my second argument so let me do this properly and show you if I start typing log and then 1000 the base is going to be E it's natural log but I said I can also say base equals 10 so what we have here are two arguments for the log function 1000 the first argument it expects the number for which it wants to calculate the log of and then comma a second argument this argument is called a keyword argument because we're actually specifying a name for this argument we didn't give this first argument a name I didn't say number equals 1000 that would be wrong the way that this code was written is the first argument that it expects is the number that I want to calculate the log of so I can just write it but then thereafter there are some keyword arguments and keyword arguments you can put in different order you can put one in front of the other it doesn't matter because they have names to them the first one that it does expect though and that's the way the code was written you can't do anything about that was the actual number and so that's not a keyword and sometimes there are two or three or four of these which are just expected in that order you can't swap that order around but once you get to the keyword arguments you can swap their order around because they've got names and I won't get confused to what you're trying to do so one of the keyword arguments for the log function is the base and we say base equals and yeah I said exponent one so that is just E so that is just a default but what I like to do after comma is hit enter or return because I like to put my arguments most of the time I like I love to put them one under following the other on its own separate line and you can see the nice little indent that our studio does for you all by itself and that's fantastic and we see log the natural log of a thousand to six point nine one next up we're gonna talk about lists so let's talk about lists we've put in single numbers we've done some arithmetic but what if we want to save a bunch of numbers or something else and we call those lists and we create these lists inside of R we sometimes refer to these as vectors but just let's call them lists we put them inside of a function called C just C for concatenate basically so I'm gonna say C and then I'm gonna pass the list of arguments and the arguments are just going to be values so let's imagine these are systolic blood pressures of patients so patient had 120 the next 120 the next 110 the next 130 the next 140 so those were the one, two, three, four, five systolic blood pressures and I'm just type them all in and I put them inside of as arguments inside of this C function let's run that and we get back the 120, 120, 110, 130, 140 as simple as that now we need to stick numbers in there here you can see that I stick actual words in there or I can even put sentences these are called strings and strings go inside of quotation marks always, always, always put words inside of your code in quotation marks they're called strings so yeah I have pneumonia, ARDS and bronchitis and I can run that and lo and behold I get back my pneumonia, ARDS, chronic bronchitis still with this mysterious little one at the end in the beginning I should say and we're gonna get to that now you can think to yourself well it is a bit unfortunate do I have to type in these numbers over and over and over again if I want to reuse them well fortunately no what we can do is create a little space in our computer's memory and we can store this in that space now this little space in memory needs a couple of things that's the way computers work first of all it needs to have a name I have to give that little space in memory a name that's called a computer variable name so what we've done here is given it a name SBP and please give it a descriptive name again you give your code to someone else you look at your code months or years down the line you sort of want to give it a name that means something to you that you can just look at it and say oh yeah that's what I wanted that's what I was referring to so SPP for systolic blood pressure that works for me so I'm gonna call this SPB and then we can that's the little name the computer variable for that little piece of memory and then we're gonna store something inside of that piece of memory and that thing is called an object and that object has a type so if we see a C with a 120, 120 that is a list so we are putting a list object inside of your computer memory and we give that little bit of memory a name, a computer variable name called SPP in basic terms this is what is happening now you'll see this little weird thing less than and minus now you are very welcome to use equal now equal in a computer language means something very different to equal in mathematics the equal symbol, a single equal symbol means assign it says take whatever's to the right of me and assign it to whatever's to the left of me so here we have a list object of integers and we are saying assign this to whatever's on the left and on the left it recognizes this characters that you typed and it knows that this should be or knows that this should be a computer variable name so it's an assignment operator this equals sign it's not an equal sign it's an assignment operator so we're gonna store that inside of SBP now how do you choose your names as I said please choose them that it means something to you don't start it with illegal characters such as spaces or numbers never put a space inside because it's gonna be seen as two separate things that you're trying to enter just stick to the basics SBP works for me now it's stored in my computer memory this list object and I can just recall it by just typing its name so I'm typing in the computer variable name in this code chunk and I can execute that and it's now going to give me back those numbers 121, 121, 101, 30, 140 so I don't have to retype them ever again they are now stored in this piece of memory in my computer under this computer variable name this object is stored there so let's create one instead of typing it in something very useful that I use every now and again is this sequence function SEQ that just reminds me I forgot about this little equal sign that I said that you could use here so you can use this equal sign but it's more common in R for a variety of reasons to use this little assignment, little symbol and to be easily get that is to hold down alt or option and hit minus that's the keyboard shortcut alt minus and it sort of shows what it's doing it's the arrow is pointing to the left here it's a little stabby arrow and it is clearly trying to show that take whatever's on the right and pass it to what's on the left it's a better idea of a visual idea of what this assignment operator is so I like using it instead of the equal sign as arguments though keyword arguments with the name of the argument and then it's value there we put the equal sign so we have here this SEQ stands for sequence function and it's taking here three keyword arguments from to by from says start at one to says wait to end at 100 and by is by how many jumps so one jump two is three jump two is five jump two is seven and that's what the sequence does for me I don't have to put the by if I didn't put the by in the default is one so it'll always go up in one then it'll be one, two, three, four, five until 100 but I want to go up in steps of two so I put the by keyword argument in there again I'm assigning it to this computer variable and now you see a different naming convention here I can actually string words together remember I said no spaces and I can use dots this is called snake case because you can snake the words together and now the proper snake case is actually just making these underscores that it really is snake case so I could have called those their patient numbers and just put these underscores in between no spaces but I like the dots when I work with R it just sets my mind that I'm working with R if I work in Python I use underscores if I work in Julia I would use underscores if I work in the Wolfram language I will just not use any spaces or dots or underscores every subsequent word will just be an uppercase letter so this is what I would normally do inside of Wolfram language, patient the first one always lower, lower case and then every subsequent word so that I can read it as a human being goes into uppercase so this is the way you can use all of these work in all these languages but I just like to use these different things and it just reminds me in what language I'm working in at the moment so for R I like these dots in between and then I'm just going to call that computer variable and we're going to see what it looks like there we go 1357 as promised now we can see something else remember the mysterious one now we see a 25 and a 49 and that is a beautiful segue into the next section and that is addressing because I have a list object here every object actually has an address just like you live at a certain address every element inside of a list has an address and it starts at one R starts counting at one whereas Python starts counting at zero so just remember R starts counting at one so that's element number one that's element the three is element number two this five is element number three and four, five, six, seven and what R does when it shows you the calculation it just says the first one on this line and you might have a much bigger monitor you might have a much smaller monitor as far as the resolution is concerned so you might see more on that page on that line and you might see less of the elements on that line it just says this first one on this line what element number is it so this 49 is element 25 and this 97 is element number 49 so it's just a small little indication of please remember that all of these have an address addresses go in square brackets and I'm just telling you that the first element in each of these lines I'm just giving you their addresses that's just a default setup so let's have a look at addressing so let's talk about addressing and remember I said that always goes inside of square brackets so if I were to call my SPP and I would just put the number one inside of square brackets behind it it says give me back element number one and the first element that we had in there was just the 120 and we see the one next to it that was 120 but what if I wanted the first three ones well I just use this colon symbol one colon three it says one, two, three one, two and three a little range operator and that gives me the first three back if I only wanted number one and number three I'd have to pass that as a list and remember how to create a list yes they are arguments to the C function so C one comma three will give me back then inside of this set of square brackets which means addressing I'm going to get back just the 120 and the 110 that was element number one and element number three so in short that is addressing gets a lot more complex so let's move on to distributions and what I really want to do here is to show you how to generate your own data now when I use a new computer language when I learn a new computer language I just want to play around and I don't necessarily have or want an actual data set that I might have lying around when we have data sets on patients we keep those very secure we are diligent about that and we don't use it to play around with so we generate simulated data and that is what we use as a teaching tool so I want to show you how to generate your own data when you learn a new language when you just want to practice just generate your own data the beauty of generating your own simulated data is that you have absolute control over it so that you know when you do the analysis on it what to expect and it also when language is upgrade so in some of the other languages that we have for instance Julia in which you can take a course in Corsera and get a certificate from the University of Cape Town Julia changes, it's a new language and it's just gone over to version 1.0 so many things change and when they do change I just want to test things out so it would generate a new simulated data set that I can play with so let's generate some data and we're going to do that using different distributions so first of all let's just look at the uniform distribution remember that is when I have a numerical data set and or a categorical data set and every value in that sample space every data point value in the sample space of that a variable has an equal likelihood of probability of being chosen so the first thing I want to show you here inside of this code chunk is this set.seed function and I've passed giving it an integer argument you can just use 1 or 1, 2 or 1, 4, 5 or 1, 2, 3, 4, 5, 6 and it doesn't matter if you use this same number every time it means if you rerun this code the same random numbers will be generated for you so if you were to run this code you're going to get exactly the same if you used 1, 2, 3 like I have here you're going to get the same pseudo random numbers so I'm going to create this computer variable called age and then I'm going to use assign to that this list object and to create this list object I'm going to use the sample function the first argument is not a keyword argument it just says give me the range of values that I can select from the sample space for this variable and it's 18 colon 85 that's a range value so it says from 18 to 85 I didn't put a step by remember like the sequence we had start and stop and end start and from end and by from to and by I should say as keyword arguments and then the by was 2 we used by as 1 is default and I can also write it in this way 18 colon 85 so that's 18, 19, 20 to 85 that's the sample space comma the set next argument that it expects is the number of values that you want and I want 500 values please and now I say replace equals 2 and 2 inside of R is all true and false is always all capitals you can also be a bit lazy and just type uppercase T for true and uppercase F for false but I like to type it out true and false and it says that if you choose a number say 27 and it then throws that 27 back into the bowl so that when you choose that random next time that 27 is available again if you didn't say replace is true that 27 was not available for a second choice and that might be you might want to do that and here we're just going to say that equals false so let's run that we now have an age variable it's got 500 values in it now look at my environment here on the right hand side we haven't spoken about that remember when we created SBP and patient number they are all there they live here so that we can take a quick peek at them and there we see age which we've just done it says there in the address there's 1 to 500 so there's 500 values and look at this it says int it means all of the values in there are integers so I looked at my list object and I looked at the specific elements and it saw well there were all integers in there great stuff let's create the next one and I'm going to call it before.after so just imagine you had a study and you measured patient's cholesterol before and after giving them a new drug and you just want to know what the difference was in their cholesterol before and after this intervention so that would be one way so we're going to simulate that according to the normal distribution and more than that the standard normal distribution and for that I'm going to use the R norm function and I'm going to pass a single argument and that says 500 meaning I want 500 values if I don't put any other arguments it's going to use the standard normal distribution mean of 0 standard deviation of 1 and it's going to give me back 500 values at random from that distribution now what if I don't want the standard normal distribution I want a different mean and a different standard deviation so here I'm going to create an SPP now we've used SPP before now if I reuse SPP I'm going to overwrite that little piece in memory with new values so it's overwritten again I'm using set seed 1, 2, 3 just so that when you run the code you're not going to get a surprise you're going to get the same values as me now look at something I've done here I've put this R norm with its set of you see if I hover over the first bit it highlights the closing parentheses so there it is let's just look at that R norm now I'm going to use three arguments the first one is how many there are 500 but then there are two keyword arguments mean equals 120 sd equals 20 now sd is standard deviation so I say give me a normal distribution with a mean of 120 a standard deviation of 120 for mean standard deviation of 20 give me 500 values back from that distribution but that's going to give me 6 decimal values and I don't want 6 decimal values I just want no decimal values so one way to go about that is to pass all of this this R norm this complete thing as the first argument inside of the round function so the round function takes the values as first argument and then it takes a keyword argument digits equal and then that means decimal places so in this instance I want no decimal places so it's going to give me these 500 values of blood pressures and when we measure blood pressure it's integer values there's no decimal values so when I simulate that I like to keep it real and so I put that inside of the round function here we create one called CRP so for instance for C reactive protein again I'm putting that inside of the round function this time I want one decimal place and the first argument is this actual 500 values that I want and this time I want it from a chi-square distribution so R C-H-I-S-Q 500 values again second keyword argument so keyword argument D-F equals 2 so 2 degrees of freedom so let's generate that and there we go now I want to do something else I'm going to have as my sample space some nominal categorical variables very easy I'm going to set my seed I'm calling this one group sample and instead of a range say 18 to 85 I actually pass a list of actual strings and my two strings so that goes inside of C the C function I'm going to call it control and placebo I want 500 random of those and replaces two I have to put replaces two otherwise I'm just going to get two back once control is taken and I don't throw it back into the bowl for being selected again at random it's not going to be there and only placebo is left so I'm only going to get the two and the 498 is not possible the other 498 so I say replace equals true so I'm going to get back and if we look up here once I've executed that we see group appears there and it's control, placebo and it's going to carry on 500 times both control and placebo have an equal likelihood of being chosen at every instance now instead of each of them having a 50-50 chance of being chosen I can actually set the weights so in this one I'm going to call side effects so side dot effects sample again my sample space is a list of no and yes I want 500 values replace is true but this time I want a probability of being selected so I'm saying in this same order so I've got to have two values because my sample space has two values and it's in that same order so no will have an 80% chance of being chosen and yes will only have a 20% chance of being chosen at every iteration of this 500 drawings of samples so very important there so we're going to have a very skewed distribution here so there's going to be many more no's than there are yeses and we can do that so we've simulated our own data let me show you on the simulated data how easy it is just to do descriptive statistics and that's up next so there we go the descriptive statistics now there's a lot of just inbuilt functions built into base R they've got nothing to do with these extra packages and here they are mean is very easy it's just the mean with the keyword is just the function at least it's just mean and I'm just going to pass my list object which was called age and I contain the 500 values there so let's just run that remember that came from a uniform distribution so every value was equally likely to be chosen and we see a mean of 51.184 median well functions is called median and the median was 50 the variance is VAR and standard deviation you've seen before that's this SD and you just pass the name for that remember the variance being the square of the standard deviation range when we do range it's going to give us back the minimum and the maximum value and in those 500 it did indeed choose 18 and it did indeed choose 85 which was the limits of our range anyway but in this instance it was it doesn't doesn't mean it would have to be there this is at random but in that using that set seeded 123 if you run this you are going to get the 18 and 85 interquartile range very easy IQR all uppercase and that's going to give me the difference between the third quartile and the first quartile and that is 33 speaking of those quartiles I can get those values back as well and one of the functions that you use I like to use quantile then I pass my list as first argument comma my second argument for the quantile function is what the quartile is I actually want so the first quartile is equal to the 25th percentile so it's 0.25 and I get back that value the 25th percentile value and then the third quartile is the 75th percentile so I use the 0.75 a second argument and I get it there and if you subtract those two you get the interquartile range which is 33 and indeed 33 plus 34 is 67 so that works now there's a summary function and that's just going to give me all of these all in one go let me show you summary, age and look at that I get the minimum the first quartile, the median, the mean the third quartile in the maximum all in one go that's fantastic if I write a report or we write a journal article for submission beautiful we can just do the summary statistics in one go I love the summary function now when we talk about categorical variables imagine that's a large dataset someone collected it for us thousands of rows long we don't know what all the all the unique values are what the sample space of that single categorical variable might be now we simulated this one we know exactly what we put inside of group but if you didn't know and you just want to see what that sample space is just use the unique function and that shows us there's only two elements in the sample space of this group variable and that's control and placebo but that's the way we designed it so no problem there so summary statistics on the simulated data very easy and I always say when you are doing healthcare research medical research bi-statistical research any kind of research do descriptive statistics first you get all this data on a flat sheet flat file usually that's a spreadsheet file even if it's then extracted from a database file when you look at the large set of columns and rows of data you don't know what that data is trying to say what the message is what the knowledge is that's locked in there you have to tease it out and as human beings the best way to do it is first to summarise it and we summarise that through descriptive statistics and then secondly I love to visualise it because that visualisation is going to give me a great idea of what the statistical analysis is going to show so describing it first and then visualising it so let's get to visualising the data so here we go as I said I left lots of space here please make your own notes in between make this document your own so the first one I want to show you is just a normal box and whisker plot so once again R has got lovely built in visualisation but you can use the ggplot I love to use plotly there are some videos out already on using plotly and R these libraries that or packages that allow you just to create even better looking plots but right here inside of R plots are fantastic by default so boxplot is the function for a box and whisker plot and let's just look at the age I'm going to hit that and there we go a beautiful plot we see our median there the edges here first and third quartile and the minimum maximum as far as these are concerned there seems to be no statistical outliers there and you can see the values here on the left hand side and they are written on their sides and I'll show you how to fix that I don't like them on their sides but let's just roll with it for the moment now let's just start being a bit more say for instance this is going to go on our website or this is going to be a presentation this is bring some life to this plot it looks nice to me I love it but let's add something to it by some keyword arguments so I'm going to say boxplot and then age is my list of values and then comma my first keyword argument is going to be col that stands for color and the colors all the colors there are specific names and there are a lot of them so the first one we're going to do is deep sky blue always inside of quotation marks these are strings main is my second keyword argument my third argument in this instance and that means the title so I'm going to call it patient age remember that's a set of words that's strings so we put them inside of quotation marks xlab means the x axis label and I want that to say patience and ylab is the y axis label and I want that to say age so let's run that and see what it says very nice indeed I see this deep sky blue color I see my title there and I see my x axis and y axis titles the labels I should say beautiful that's quite pretty now just a word on this you see this is happening inside of the document itself not here inside of the plots when you do all of this inside of just here we could have done all that code and instead of putting each of the sets of codes inside of a chunk here we can just write the code but if we wanted to leave ourselves titles you can't do that here if you want to leave yourself little comments you have to put it in a comment line like we do here you can't just write normal words in here so I like to work in here rather put my code inside of chunks and then write in between the stuff that I want but if you did that here the plots the plot will appear here on the right bottom side but in r in d file the r markdown file it is going to appear right in line so let's look at histograms HIST is our function of choice HIST it's going to do histogram let's take the before dot after now remember we chose those 500 values at random from the standard normal distribution so we better see a gas sort of curve here let's make the color pink main is difference in measurement before and after treatment my x axis label my y axis label very easy to do let's hold thumbs and look at that beautiful standard normal distribution isn't it? fantastic this histogram I just love it and we see our labels there and we see our title and we see the color what about a scatter plot now remember scatter plot is we're going to have independent variable on the x axis a dependent variable on the y axis so each patient will have this these pair of values and we want to have a look at them the key word the function for that is to splot we're going to plot the age against the systolic blood pressure the first one is the x axis value independent variable second one is the y axis value the dependent variable so remember you have to have an equal length of those the same numbers of we had 500 and 500 those pairs you have to have a value for each side of the pair we're going to color this just in blue we're going to have a title and we're going to have some labels and there we go we see a fantastic nice scatter plot and we can really see there's no dependency between age and systolic blood pressure so I know when I do linear regression or correlation that's going to be very poor age is not a predictor of systolic blood pressure in the simulated data that we created here so we've created all our simulated data I don't like to keep it in this format I like to put it inside of what is called a data frame and especially when I have a spreadsheet file with actual real patient data we bring that into R we can import that we're going to import that as a data frame so whether it's my simulated data I'll convert that to a data frame and when I import my spreadsheet file with actual patient data that's going to be imported as a as a data frame as well now data frame is one is built into R but I don't like to use a data frame I actually like to use one of the newer packages or new ideas that gives us a tibble and it's a weird word but it just means a fancier data frame so let's just have a look at tibbles so let's create a tibble from the simulated data that we've had so I'm going to call my tibble my.data that's my computer variable name and I'm going to assign it to this tibble remember that's a package that we, a library that we imported so tibble is the function and what it does is you have got to think of this of a spreadsheet file I'm just trusting that you've seen a spreadsheet file before in your life the first row is just going to have all the column names in it and those column names are usually first referred to the variable so there's going to be age or patient number systolic blood pressure, etc those are the column headers and then down that column we'll have all the blood pressures for instance of all the patients but if I look down a single row that'll be the blood pressure, age whatever for that single patient so I have this spreadsheet idea in your head so what we're going to do here is give the column header or the variable a name so age I'm going to write uppercase age and I'm going to assign to that my 500 age list object and then I'm going to have difference as my second column header my second variable and I'm going to assign to that the before.after 500 list object uppercase CRP I'm going to assign CRP and see how, why I created those names that meant something to me and now I'm just using it in a different way that still means something to me so we see group equals group there SBP and I put S small and BP large whatever you want to do inside effects I wrote in all in the uppercase but I'm assigning those list objects with the 500 values in each to a name that's going to be on the top first row, the column header row of my spreadsheet but remember there's not a spreadsheet this is a tibble let's do that and now we see my data appear on the right hand side in our environment here and to the right here we see a little spreadsheet icon and if I click on that a new tab is going to open up very tiny yeah you won't see it if you're viewing this not at a maximum 1080p but there you see it looks like a spreadsheet there's my age, the difference CRP, group, SBP remember that's why I set the names and there's all the 500 values under age all the 500 values under difference all the 500 values under CRP but every row is one patient for instance so this patient was 37 their cholesterol came down because the afterwards less than the before there's the CRP value they fell in the control group though and they had that blood pressure when they got into the study and they didn't develop any side effects for instance so this is close that table but you can always get to that table just by clicking on that little icon there if you are down here you could see the console opened up and what it was actually is the view function with a uppercase v and then my dot data you can also come and type it in here let's do that and that's going to open the same thing for us here at the top under a tab so that's the same as that little symbol there let's just minimize our console here so we can carry on so that is our table now I might want to share this as a spreadsheet file with someone else I want to export this as a spreadsheet file and inside of the table there's this function called write underscore CSV now please do the following for me as a proper researcher never save your spreadsheet files as Excel or any kind of proprietary file format Excel SX save it as exported as CSV comma separated values files that is a much better way for us to share files with each other so I'm going to use this CSV file now when you open a CSV file inside of Microsoft Excel for instance it's going to look like a normal spreadsheet one thing that it doesn't have it won't have all the fancy formatting and it won't have different tabs at the bottom of your spreadsheet file that you can have different spreadsheets inside of the same file there's just one spreadsheet per file but anyway that is the proper way to do it so I'm going to first argument is going to be my dot data that is our actual table that we've just created and then I'm going to give it a name data dot CSV where are my hard drivers are going to go well remember we said set wd and then we passed this argument to get wd it's going to put that in the same folder as this rmd file I set it up that way if you want to put it in some specific way remember that you can say c colon backslash you actually have to put two backslashes we know we won't talk about that I hate doing that just all lives in the same folder for that matter let's talk about importing a file so with tibble we also get read underscore csv now just in normal normal base r that will be read dot csv that's a different function that's going to import it as a data frame I don't like data frames I like the more modern tibble so I use the tibble library and I use read underscore csv and I'm just referring straight to this project the data dot csv file if you downloaded it this from from my github repository that project data file is going to be there for you and we're going to use that for the rest of this tutorial and it lives inside of the same folder so once again using my setwd and its argument getwd it's all in the same folder so I can just bring it in I don't have to type the address to this file in there so I'm going to execute this and there we go I've now imported a spreadsheet file so here we exported a spreadsheet file here we are importing a spreadsheet file and if you actually go look at your folder structure this data dot csv file is actually now going to appear with the simulated data and we can give it to each other now remember we imported the dt the dt library that is a very specific library if you're going to export something to the web as an html file it just formats your data very nicely and I like to do that when I do export html it has a function called data table and I'm going to pass data remember that's this spreadsheet file that we just imported as a tibble I'm going to do that and it's going to do this very nice formatting when I export it it actually has a little search bar and you can go to the different pages until you get to the end remember there's 500 I think in here yeah there we go see there that's the data that we imported it has seven variables and 500 each values for each so that's slightly different from the simulated one this is one that I had simulated before and saved before but it's a very nice way you can search and you can also just do a sending to descending order there so very nice now that we've imported this the actual data set there we go now that we've imported this actual data set let's just have a quick look at the dplyrdeplier package it is a bit difficult I'm going to warn you right now what it allows us is to extract only certain values that we are interested in from our data and that's a very powerful thing but it's not the easiest thing to get used to and you're going to have to watch some tutorials specifically on this yeah I just want to introduce you to this concept of extracting data using the dplyrdeplier library so the first thing we want to do here is just to create a new tible because remember we used read underscore CSV that imported it as a tible so we want to create a new tible but we don't want everything from that we only want to select certain things and what we want to select is only patients in group number one now let's just have a quick look at this spreadsheet file that we imported just to show you it has an age column it has a difference column as well it has a CRP column it has a group column and inside of this group column we see that we have only ones and twos ones and twos so patients were either assigned to group one or to group two and what I want to do is only extract the patients that were in group one so how would I go about that? first of all I've got to give this new tible a name a computer variable name and I'm going to call mine control dot group because imagine the group one patients were in the control group so now you see something very funny going on here I want you to ignore that let's just go to this line 3 2 5 because this would be one way to write it now see this as a comment and if I uncomment it and we comment this one so that line won't be executed the second line will be executed so let's start with this one I'm going to use the filter function and it just says filter row by row that's what this thing does and the first argument is well what tible are you referring to well I'm referring to the data tible and then the second argument let's just put that on its own line it says filter group equals equals one now you see the two equals sign that's called a Boolean operator it asks a question is that line that row does it equal does it contain a one and that will return either a true or a false if it's true it will be included if it's false it will be excluded now this is not the normal way in which we write it we use that little symbol there let me comment out this let me bring this comment this line out as well you've got to comment out line by line so let's do this that is the proper way that we do it we use this symbol it's shift control M shift command M it's the pipe operator we create a pipe line of something that we want to execute and what the pipe does this whole little thing it says what take whatever is to the left of me and pass it as first argument to whatever is to the right of me and that's just what we had down here it was data comma data comma so it just says take this data and pass it as the first argument and it doesn't make much sense now but as you start using it as you actually see it makes a lot of sense especially if you start stringing together a lot of these pipes so at the moment it just says take my data table go down the group column and find only the ones so if we run that we're going to create a new table there it is up there and if we were to look at that we'd notice that it's just ones down the group there's going to be no twos there whatsoever whatsoever here's another one that I did younger patients so I said younger.patients go to the data table and pipe that to the filter function and what I'm looking for is the age column everyone younger than 50 so that's going to extract everyone that's 49 and below into a new table we see the table there and when you open that you'll see everyone is younger than 50 now this one is slightly more complex I'm asking here I want the patients younger than 50 and they must be in group number one so again computer variable name younger.patients.to whatever so the data table and I'm going to pass that on to the filter function and I want two things and what we do with two things is we put this little ampersand in between them so strictly speaking and let me do that right now I've got to have another set of parentheses there so it says age less than 50 and group equals one so I want both of those in this filter function now it's only going to have that patients younger than 50 and only in group one and I've created a new table called younger.patients.romanumeral2 and as I say you have to look into deeplier itself leave some comments down below if you want videos just on deeplier but it's a very powerful thing this is just a very brief introduction to it very powerful indeed to tease your data apart and only get the values that you're interested in now next up we're going to look at some more descriptive statistics this time not on the simulated data that we created one by one just list object by list object this time now we're going to extract some of this data using deeplier from our table and we're going to do descriptive statistics on that so let's have a look at that okay let's go about describing this table of ours so the first question we might want to answer is can we get the mean age of the patients that belong to group one and the mean age of patients who belong to group two now we can extract those as two different tables first but we can just use the main table and do that let's see how to go about that I'm going to call my table that's data and then I'm going to pipe it to the first function which is the group underscore by group by that makes sense and the column which I want the group by to happen is by the group column and we remember there's patients in group one the control and two the treatment arm whatever the situation might have been and then I pipe that once I've grouped them I pipe that and you see why I said that these pipes once we start stringing them together they actually make a lot of sense so I'm going to pipe that to the summarize column a summarize function I should say and then in the summarize I want to create ask for the mean of the age column and what we do here is we just give that mean it's just a little column name so I'm just going to say mean dot age that's a name I decided on and you're going to see why we give that little descriptive name just in a moment so once again take that editable group by whatever you find in the sample space of the group column and then summarize that for me by calculating the mean of the age column so let's run this and see what it does so there we go we get a double back and it'll have group here and it found group one and group two and then it's going to give me back the two mean ages of those two groups and see mean dot age there see we just had to give that column a name and that's where this little name comes from so very easily done so in the next one let's just see if we group by the side effects now some people had side effects some people didn't have side effects I want to know how many had side effects and how many did not have side effects I just want to count the number of those unique sample space elements so start with a tuple we're going to group by the side effects and we pipe that to the summarize function again and this time we are going to use this n function n open-close parentheses nothing else and we're going to call that count what this n is going to do is going to count the unique value so if we run that we see count there that's what we created and it just does the count for me with this n function 289 nos 211 yeses so you can see when I created this data frame this spreadsheet file and simulated data before I gave a bit more weight to the no not the 8020 that we did in the beginning this was slightly more equal so let's do this let's what I want to know now how many people in group one had side effects and no side effects and then in two how many had side effects and no side effects easy peasy I'm going to say data I'm going to pipe that to the group by function and according to the group and I'm going to pipe all of that not to the summarise now I can do summarise but count actually will do the count for me so I don't have to say summarise count equals n and I can just do this all in one go much easier I'm just going to say count and then pass side effects as an argument to count so that's just an easier way to go about it and now we can see group one group one, group two, group two so group one nos was 137 group one yeses 114 group two nos and group two yeses and that is what you need for a contingency table your observed table for a chi-square test for independence easy to extract that information from our table we have the values there we can do a chi-square test as easy as that so when you have a table when you've either created your simulated data or you've imported it from a spreadsheet file it's easy, you can see at least the potential for how easy it is to draw this data out the values that you want and just describe them I do mention though that using dplyr in the beginning as part of what is called the tidyverse is a bit difficult in the beginning but you'll soon find out that there's a few functions that you can use in dplyr you string them together as a pipeline and in the end it just makes so much sense it becomes intuitive to use after you've described your data what do you do? you visualize it so let's have a look at visualizing the data now these packages work very well together the table and just the normal plotting inside of R let me show you so we're going to do a box plot and I want to do the age so I want to normal box plot of the age remember before we just did the age but this time I want four of the different groups the different sum the sample space elements in groups so remember they were type group one and group two patients and that we just do by creating a little formula and we do that formula by this little tolder sign it says I want a box plot of the age but please separate it by the group but now I've got to tell it what the table is that it's got to work from and it'll work for old fashioned data frames as well so I'm going to say data equals data that's my keyword argument data and that's my table name data it doesn't matter that they have the same name R can figure that out for itself color now I know there was group one and group two I know there were two so I can pass two different colors and it's going to be in that order and I'm going to do deep sky blue and orange I'm going to have a table name a title I'm going to call that age distribution by group x-axis label a y-axis label and this time I'm going to use this new keyword argument las and I'm going to set it to one and I want you to guess is what it's going to do that's a bit unfair let me show you there we go beautiful deep sky blue and orange group one and two age here but look at this text on the y-axis at all standing up right now and that's what the las equals one does for us it just shifts those numbers so they just look proper but even isn't that just the most one of the most beautiful graphs you've ever seen I absolutely love it I do like plotly a bit more but there's some videos you can watch that I've made on plotly let's make a scatter plot remember for scatter plot the functions just plot and this time we want SPP systolic blood pressure by age now which one goes where which one is the dependent variable that's on the y-axis which one is the independent variable that goes on the x-axis there's a little clue for you the x-axis is the age and the systolic blood pressure is y so it's always y-axis then x-axis so it's saying I'm trying to predict this dependent variable by this independent variable so it's always y till the x in that order everything else being the same las is one and once again we see that we don't have this age is not really a predictor of systolic blood pressure there but with this table very easy to do that let's do some inferential statistics I'm going to show you the most common statistical test we're going to start with students t-test so there's my two pound signs hashtag so that's going to be subtitle each two signs and then a sub subtitle with the three hashtags so that's going to be slightly smaller and then you see I write students and then underscore t underscore underscore before and after a word or a set of words or even a paragraph that means italics I want that t to be an italics when it's printed out and if I put two of them on each side that means bold if I put two in front and two behind it's going to be bold one in front one behind that's italics because I want the t to be italicized so let's do students t-test very simple to do when you have a table t.test is my function name I'm going to part some arguments to it here's a list of all the arguments at their default values so you can actually leave them out remember keyword arguments have default values so if you don't put them in those values are used behind the scenes but let's look at them so you can see that you can set them I'm going to say I want you to compare students t-test remember that is a variable between two groups same variable in two groups compare the means to each other so it says compare for me the systolic blood pressure between whatever you find in the groups remember students t-test you can only have two groups and we know group group only had group one and two so this is going to work for us the table well that's the data table alternative what is your alternative hypothesis is it two-sided two-tailed hypothesis or is it one-tailed we want two-sided and we pass that two-sided argument mu is the mean we expect the mean another hypothesis that we create we want to say there is no difference between the two so we say mu equals not zero PAD equals false we can also do a PAD PAD sample t-test and then we just say PAD equals true here but let's for argument sake imagine that there was no grouping and there was no pairing here so it's PAD equals false and we're also assuming that equal variance is here so I'm going to say var dot equal equals true so an equal variance unpaid t-test that's the normal students t-test and we're using a confidence level of 0.95 in other words an alpha level of 0.05 and look at this beautiful result that we get it says two-sample t-test the data is systolic blood pressure by group we see our t-statistic at 1.4 we see degrees of freedom 498 we see a p-value of 0.15 so it was not statistically significant we can see the 95% confidence intervals around our t-statistic and we can see the two-sample means 125 and 124 up there abouts so all the information you need and really lovely when you start writing your journal articles for submission to a journal lovely stuff let's do a bit of linear regression so with linear regression we build a linear model the function for that is lm linear model lm very simple I'm going to say spp a tilde age so the first one again I'm trying to predict the systolic blood pressure given the age that is the linear model that I'm trying to build data equals data but I'm putting all of this as a single argument inside of the summary argument and the summary is going to give me a nice little report as well and there we go it says the formula as we're trying to predict systolic blood pressure given the age we can see the residuals there in our prediction the descriptive statistics of the residuals we can see the coefficients there and we can see right here at the bottom our adjusted r squared is almost 0 and we could see that from the visualization there was really no correlation between those two so really age is not a predictor of systolic blood pressure and our adjusted r squared is very low we see our f statistic there remember these things are all built on f statistics and analysis of variance really and we see a p value there of 0.12 so really not significant there at all very easy to build linear models if you want to add a second one it'll just be plus and then for instance we add crp so you can just continue adding pluses there and you'll build more independent variables into your linear model lastly let's do a chi-square test for analysis a chi-square test for independence now you can see something else here you can see it's a level 3 heading here with a 3 pound or hashtag signs and then we have dollar and dollar that allows us to use something called tech or la-tech and that is a mathematical representation of characters very beautiful when you print this out in pdf or word or even on the web a backslash chi means a chi character the greek symbol chi lowercase and then the carrot symbol in 2 that means put the next what comes next put that on superscript so it's going to be chi-squared written out very neatly and tech or la-tech is a language on its own easy to learn the basics and incorporate that but that's not what this tutorial is about so what we need is the contingency table remember the observed table how did we do that I'm just going to redo it here data we pipe that to the group by and group by group and we pipe all of that to account of the side effects so let's get those 137, 114, 152, 197 pay particular attention to the way this was done 1122 you might sometimes see no yes yes no so when you do build your contingency table make sure that you put the order of the values that that is always the same so the way that I like to do it I'm going to create two lists the first one is my group 1 that was 137 and 114 and group 2 was 152, 97 and I'm going to say the number of rows is 2 there's going to be a group 1 row and a group 2 row and then what I want to do is to create a matrix so this is what we use a matrix function I'm going to pass some arguments the first one is a list object c and I'm going to pass the first rows group 1, the second one group 2 and the number of rows in a row is my keyword argument and I'm passing that in rows to it and that was 2 and I'm saying by rows equals true now you don't have to do all of this this is just some fancy eye candy that I'm building here once that is done I'm using the row names function I'm passing an argument to it and I'm actually making this out to be a computer variable that's something very specific to R and just go with a flow I'm going to put a list object to that called group 1 and group 2 and then I'm going to use the call names function on my matrix that I've just created and I'm going to put no yes to it and then finally I'm going to call this matrix that I created now you don't have to do all of this it just produces this nice little table so that when I do share this with my trainees who don't know a lot of statistics it's just a nice visual representation of that what we actually only need is those numbers 137, 114, 152, 97 that's actually all we need it but you can always just refer to this code create your nice little table as we can see there we can all see what that contingency table is all about and I can pass that whole thing with its names that I put in here the row names, the column names I can pass all of that to the chi-squared.test function I don't want any Yates correction so I'm going to say as my second argument there correct equals false and let's run that and where there we get a Pearson's chi-squared test we see there one degree of freedom and we see our chi-squared value there of 2.142 and the p-value of 0.143 so really the group and side effects they were not dependent on each other and that's it I hope you like R and I hope you want to use R and I hope you have a good understanding now of what it was all about so let's save this document so very lastly I want to show you how I created that final document that we uploaded to our pubs and what we're going to do is you see a tiny little button here, net let's click on net and we're going to wait a few seconds and it's going to net from the netter package it's going to net all our code together as a beautiful HTML file remember we said TOC to true in the beginning so we have our table of content of all the headers that we use the two and the three little hashtags built that all together fast nicely I can click any one of these it can go down there's the logo that we put in you see the colors our navy blue there for the title that was in the yaml right at the top remember and then our second level headings they were all colored this gold color they're all there and there's all our code and the beautiful execution of that we see our plots there absolutely fantastic and because I've published it before it'll actually say republish here but if this is the first time you haven't published it before you can hit that publish and that's going to allow you to open a free account on our pubs and then open up a page where you can give a little description to this file and a name to this file and you will have your own our pubs website you can also open this in the browser right here and that is why I like to not do this as a script but I like to create these rmd files you see there's a tiny little down arrow there next to knit and let's remember I said we can overwrite what it's exported to because here we can do knit to PDF and knit to word as well fantastic I hope you enjoyed this tutorial let me know in the comments down below if you want to see more about the specifics before I recorded this video I actually put some Plotli videos out there and one or two other tutorials on R have a look at those those will all be in the same playlist and I hope over time to have some time to make more tutorials on the use of R for your statistical analysis