 use it. So this is an advantage of R. So just to review a little bit why you need to learn that. So get R. So if for those who have not installed R, this is a link. You can go to the link and choose whatever operating system. Take your user's window, Mac or Linux. So I also have one installed right? Okay I will. So you also need to get R Studio. So what's the difference between R and R Studio? You can see R as a compiler is a work with behind your computer that changes whatever you write into machine language. So computer will understand. R Studio is an identity interface. So here are another terminology before you use integrator development environment IED. Okay thank you. So R Studio will be the editor interface. So it will be more user friendly rather than you. So you're just like this is a place where you type code and easy to visualize your code. So some features about R Studio. So there are syntax highlighting. So for example like if you want to write a for loop then the for will be highlighted into a different color. So you will not like squeeze your eyes to see it clearly. So some key words will be highlighted in different color either for you to refer. It is code auto-completion. That means R Studio is very smart enough if you want to write for example for loop and I use for loop. So if you want to use for loop you just write the for if you auto-complete the bracket for you. So you don't have to type the bracket yourself. So it's very smart in that sense. Smart intonation that means for example people want to write loops in different lines. So when you press enter if you automatically give you some intonation. So it's all for easy references later. And you can also execute the code directly from the editor interface which is R Studio. So you can just press run if you run line by line. So you can choose a line by line or you can choose a block of codes and run them together. Data viewer is whenever you input some data into R you can go to the data viewer you can view the structure of the data. So it's for you to see the data where you write it where you see that in that sense and R help. So whenever you do not know some function there how you write some function and the function what are the arguments you need you can just write a function name with a question mark if the R will tell you what are the break what what's the argument inside this function. So this is a R Studio. That's why normally people just download R but I recommend you download R Studio as well. So that's the background now let's go to the real cost R language. So in the R language today what I'm going to teach is some basic syntax. So that means what are the grammar rules in R and the data type. So that means for whatever language you use you have to make sure the data you bring into the software are supported. So I will introduce you are the data type R support and also a data structure what are the structure of the data. So the data can be a columns the data can be a matrix the data can be a data frame. So this I will highlight all this later and Arithmetic operators. So how do you use R you can use R as a calculator. Arithmetic is just a plus minus multiplication division all these things. So there are rules for this in R and the logic logical operator that for example truth or false equal or not. So these are logical operators. So I will teach them one by one. So let's go to the basic syntax. So now you may on your R still do which look like this all on you may not have this plate you may not have whatever I have here it looks like this should be right ok everyone on right. So let's do a so I will introduce the place this region is called a console you can see from the word it's called a console here. So the console will be the place where you execute your code and this place the empty place will be the iditer place. So you write your R code here and this region will be the global environment normally if you store the variables you just input it will give it will have the overview of all your data so which is the data view I just mentioned. This place has multiple use it can show you whatever packages you already pre-installed oops or the plots because so far we have not plot anything so it will be empty ok. So let's go to some let's do some basic codes for example you go to your editor you press this button which is the plus you can see at the top left corner there is the R script you press this it will generate a place for you to write can you see it will be untitled no no this is R you need to open R still do this is R yeah have have you installed R still do this is R R itself has an editor interface but it's not user friendly this is R you need to install R still do this one yeah open it press it should give you a yeah I think one is R and one is no no no R still looks like this this is R icon this is R still do icon yeah so I need to so Xiao Dong is my friend so he's also a pro in R so he may just go around and see if everyone yeah yeah so everyone make sure you have this interface yeah yeah correct I just use a different background if you want to change the background you can go to the yeah black one looks nicer right so you go to the in R for example okay in R you go to tools I think let me see yeah global option appearance okay so if you want to change it into different color for your art because mine is black so it's easier to see white one is very hard to see right so you can go to the top so you can yeah you can go to the tools global appearance go to the top top layers here tools global appearance a global option appearance then you'll change your color here tools global option then appearance oh sorry here you can change whatever color you want you can press anyone then choose the one you are comfortable with okay everyone got it this interface already right yes this is the correct one okay let's do our basic stuff we do very basic thing for example you can press two you can use R as a calculator if you want two times bracket five four plus two so what would be the answer right so you already write this line of commands right how to run it how to tell R to run it you can just choose this line then press around here okay another way of running it is you can choose it then press if you are using Mac you can press command plus return if you'll be the doing the same thing if you do not want to do press the run button you can choose it press command and return if you're using windows you can just press control enter already return if i'm not wrong right okay so you can see from here if you return you uh you can see here is the one so you don't bother what this one means this just means this is the first line of it so 12 means the number right so this is very basic you can use R as calculator all right then variable name this is very important in R because normally if you want to just to calculate basic arithmetic you can use a calculator you do not need to on your R so what R can do is it can assign values into a variable so first of first first thing first what is the variable in our any programming language if you have any previously encountered what is a Python what is a C language you will know this is about the same so variable name is just a name you or any variable is just a name that you assign a value to it then the name has some rules for it so the name must not contain symbols other than underscore and a dot so you have to make sure this is the first thing you when you design a variable name it must not begin with a number and also it's very case sensitive so it's for example value if you want to install some value into a value variable value is different from value so you see the v is capital is the it's capitalized so it's case sensitive and assign value in R language you can assign values by a equals sign or you can assign value by a arrow but the arrow point to the variable name for example here the value here you want to assign 23 into the small value so the next time when you want to call this 23 you do not have to write 23 you write value the value will return you a 23 so that means that you assign value to a name is it easy to understand then another thing is that you assign 45 to your value the big value so the next time when you write big value it will return 45 and ls bracket what is this is a list so that means this function will list whatever values you install in your computer so later i will show you and rm what's rm rm comes from the remove so when you want you do want any value you just remove it will clear the ram and this value is no longer stored in your computer so let's try it bracket one you mean this one i just mentioned it's just a it's a first number i mean because any value stored in r normally is stored in a vector so that means this is the first element in this vector but you don't you don't have to bother this this is just for r to memorize the location okay you'll just look at here tell okay so let's try to assign value for example you can write value you can either write this way with an arrow point towards value for example you give a 23 another line you can write big value you can write 45 okay so this is a same thing if you write this equal sign okay equal sign and the arrow pointing towards the the variable they do the thing same so my preference is using use to use equal sign because you have to press two button for the arrow right so um let's write both okay you already write this two line what this one cannot change the color there's a little bit with problem with the screen oh okay it's very dark right okay i change it to white because when i when i do the programming i prefer is black okay okay what sorry they are doing the same thing i just want to demonstrate arrow and equal sign they do the same thing so you can choose whatever one which one you you're very comfortable with i am comfortable with equal sign oh i just want to demonstrate if case sensitive so big v and small v they are representing different variable so next time when you want to give a name you may want to pay attention so next time when you want to write you already installed 23 into a small value right then you when you call without the then you want to call then you write a big value if you return your arrow message because you do not install data into the small big value yeah okay can see now clearly right okay so you already installed you already put these values in um in this variable how to make the computer store it so you have to run it right you just write you have not run so how to run that's not i said you'll choose it press run tada you'll see here global environment so this global environment will be the environment that store whatever things you already assigned value so it will be appearing here okay you can run it in that way then you can also remove it okay after remove it i just want to demonstrate with you this part when you use the equal sign another way of doing it apart from press run what you can do mac user can press command return they do the same thing okay okay all right so another what other things you can do you can list ls bracket so it will list whatever values whatever variables in this computer so it will list value and value so this is a variable thing currently stored in your computer ram okay you can also do in remove about remove it very troublesome you can just press this that's why i say user friendly right you can press button you start writing codes okay so um basic thing text everyone clear so they say how you write code in our editor so now i will introduce data type what's data type um data type so whatever data you receive it will have basically three types so it can be a numeric data so what's a numeric data just it's a just numbers so in r r does not really separate double with integer so if you you learn c before c has this double int um different type of uh data so what's the difference integer means just one two three four five all these numbers doubles means um they are float numbers for example 4.5 4.5 that means with a dot with a decimal dots that's i'll call a float number or you can call it double so in r r doesn't really care it's an integer it's a double so it's uh whatever numbers as long as numbers it's called your numeric another one is our character so r can also affect character so what does character means that means whatever things that that is surrounded by a quotation mark it's a character for example my name it's a just a string right for example my name it's a just string so it's a character so not only it can install numeric it can also install character also it can install another type of uh data type it's called logical true or false so it's very important in programming language to know um this type of data which is a logical or zero or one so sometimes when you want to represent if you you for example you can visualize a spreadsheet first column the name of the students second column have you attended school today so teacher will mark kick or cross so you can just replace the stick with a one cross with a zero so these are logical yes or no only two options all right so these are three data types it's very important to write distinguish this three data type because later you have to make sure what are the different function to use okay so another data type is a missing value okay so in r will r deal with missing value for example uh first column uh student name second column have you attended school today some student yes some student no some student no value why no value mc right you do not know um the student the student does not attend school but it's not considered as absentee them because he has a mc so while you run you you you prefer press put nothing no yes no no you just put a missing value no record so in r it will be represented by a symbol which is called n a okay without quotation mark n a just n a not available and operation on n a result in n a so that means if you want to time something with n a it will produce n a of course nothing times there nothing is there nothing times nothing is still nothing and in r there is a function called a stop n a bracket to check if there is a any missing value inside of the function or inside the data so for example you press you'll go to here x you assign it to n a so it will be the next time we want to call this x if you return on n a okay just now if you want uh cannot get it didn't follow n a is just uh i just want to demonstrate n a is a very special um uh character in in r language that means whenever you do not have any value you can just use n a to stop or there is a missing value the vector there are some missing values it will represent by n a for example you just uh next time when you have a spreadsheet some uh columns are not some some um block is not filled there is a and now no nothing there so in r when you're reading if you represent by n a so i just want to say this okay so just now you still remember your value stuff you can press value you press enter where will be where will be the number 23 because 23 this number is stored in value value is called a variable okay next time you call value it will return 23 so now they are equal right so they are interchangeable in that sense all right then uh your idea has your x right x can be anything variable right x is a variable right still remember it's a variable name i give it to n a so when you want to check if x is n a is n a x it will return true so it's n a the logical uh check so if you check whatever variable which is your x or whatever vector if there is n a inside it so if there is n a inside it will return yes yes in computer language means true okay true or false okay you just follow so these are three mean data type and one more data which is a missing data type okay all right can get right okay now let's go to the data structure this part is a little bit hard to understand if you do not have programming language so please follow me closely if you do not understand ask me on the spot okay so what are the basic data structures in r in r data are installed in different ways okay you'll just remember you can just think data structures are different way of representing data so for example the data can be stored in a vector so in vector this vector if you're all familiarized with what you learn other languages for example c c language people store data in arrays okay so you can think uh vector is kind of equivalent in r to arrays in c okay vector is just a line of data of columns of data so it's just a for example you want to know what are the scores of this students performance in their test so we will have 100 100 points 95 points 92 points all this you know if you want to represent excel it's just one column right so a vector is just a if you if you extract this whole column as a data structure so if all the data will be separated by commas so this is what a vector is there are also matrix matrix so matrix you can think of just a table okay so you have horizontal lines you have vertical lines so it's a spreadsheet matrix is a spreadsheet okay list and array is a little bit complicated so for your use most of the time people will use vector and data frame these two are most commonly used so f will just emphasize vector and data frame if you want to know what are the differences between matrix list and array you can go to the you can uh after the class you can check okay so i will introduce vector in detail this is the most commonly used data type so what is attack what is our data structure and how do you import how do you design a vector in r for example in r there is a function called the concatenate concatenate function is just a c bracket okay with a star with a c so let's go to your interface for example you want to store a variable give a name for example x1 all right so you want to store some vector into this x so you press equal then you press c see the concatenate so you want to put 1 2 3 in this vector so you just press 1 2 3 separated by comma okay then you press then you round this line so you can see in your global environment x1 is represented by this way so what's this num num is just a shows a numeric okay so this data type is numeric 1 2 3 that means in this uh vector from position 1 to position 3 it has three elements okay 1 2 3 that means these are the three elements stored in this variable okay so you need to understand what does this thing mean so next time when you press x1 what will be returned anyone can tell me 1 2 3 okay so that means you store this vector in this vector name x1 all right can follow everyone can follow right so this is numeric data right so what what other type of vector you can produce okay the length now is 3 right you can extend this forever the length can be extended okay i just do a demo is length 3 you can just write you can continue writing this like it update this whatever things you want okay i just do a demo the length is you can design whatever length you want so next this is numeric data now i want to design a character array a character vector so how x2 i give another one but you remember for character for you need to surround them by either a single or double quotation mark so when you want to design a character for example colors for example i want to write yellow then i want to write blue okay press enter so you see in your global environment there is a chr that means character so the r is not enough to pick up okay this is our character vector it starts strings rather than numbers 1 2 3 means position 1 to position 3 each of them starts red yellow and blue okay so when you press x2 there will be the three elements shown on your screen okay can follow i'll be too fast i slow down okay can or not okay so next you can write another thing now we already have numeric character next one will be the logical right logical will be represented by true or false correct so where you do x3 equals to c okay you can write true you in r in r true or false you write them you write them in capital and it will be highlighted in blue that's why i say r still do can have this syntax highlighting so special correct special phrase or special syntax it will be represented by a different color so easier for you to refer okay so true or false so true false true i just gave a test then run so you see in your global environment x3 is not logical okay logic and 1 to 3 means position 1 to position 3 each of them is not true false true so x3 when you press it it will return this logical vector okay so this is how you input data in your r all right and another thing is in r all these are very basic later i will teach you most of the time you won't press the data yourself most of the time you will give a data set okay you'll be given a data set then how you deal with it later i will teach this is just a basic this is how you do it in r language so if you're familiar with this later i will teach you when you have a data set how you import okay no worries okay another thing is r it will be forcing you to make sure in the same vector it will store the same data type okay for example you will you will see here we will put one which is numeric a which is with a inverter comma which means character and true so there are three data type right r will force it into the same data type so can you tell me which data type r choose to force it to anyone can tell me yes character r will take them all as a character when you press one this is variation numeric right but r when it returns it gives a inverter comma to one so this one is no longer numeric one this one is character one the difference between character and numeric can everyone knows right character means string numeric means number okay so true will be no longer logistic no longer be a logical element says you'll see here is no longer blue right so it becomes green with the inverter comma so now true with the inverter comma it actually means a string okay r will force every different data type into the same same data type so they say you need to be very familiar and so that actually remind you next time when you want to use vector you make sure you use you input the same data type in the same vector it's not r will force you to do it the next one you see here another example is when you press one and two what r will return you numeric yes so r will decided to use numeric instead of logical so because true means one right you remember false means zero so if you just change true change true to one so it become a vector with one one okay so this is just a little bit of what r can do for you but normally you do not bother about this part most of the time you will be making sure the same data type in the same vector so far can follow am i too fast okay so some important functions in vector generation wrap re p bracket wrap means yeah so wrap means replicate what can you do is you can see from here you can go to your screen go to our interface you try to press this you will try to write command line this this line you try to write it wrap see so can anyone tell me what does this line mean c means you want to create our vector right a b a and b they are character or they are numeric their character right it's represented by inverted comma double double inverted comma right so time equals to three that means what do they mean that means you want to replicate a b three times so you do not have to really write this time you can just write three it will perform the thing thing for example i can go to the x4 equals to wrap then you can give a a b three okay you can write the time equals to three you you can also ignore writing the time so the r is might enough they it knows the next argument in this function will be the how many times you want to repeat okay so it will repeat a b a b a b three times another thing you want to do is for example you want to wrap for example um what you want to do you want to write uh sorry true false yeah also can see true false two times okay this is how you write it another easy way is to write it is are there some cheat uh cheat way to do it for example you can just write write can anyone tell me i just press it so this is a easy way to write it so you still remember how i represent our vector this position one to position three so next time when you want to use a shortcut to write instead of using a c bracket concatenate function you can also use this shortcut one to two that means number one to number two you can so if you want to generate a sequence of one to hundred how you can choose to write if you want to generate a sequence for example set a sequence um s equals to one to hundred you want to write one two three four you want to press it until hundred no right it's very silly right so a shortcut is you can write one column hundred okay it will generate a sequence for you next time when you call s it will return the sequence okay so this is a shortcut the user common user uh column okay so it will be the shortcut for c to generate a sequence so if you want to see if you want to represent one to turn two times how you write try set one to turn two times okay you can write time equals to two but r is not enough it will know the second argument is a number of time you want to replicate so you can just save your time by not writing the time equal to okay oops oh sorry it's wrap it's not set i have not wrapped okay replicate it's not sequence all right so it will return you one to turn two times it will give you the sequence okay so here you can see wrap zero comma five so that means it will replicate zero five times so it will give you a zero zero zero zero zero zero zero and start but this is just a um this variable is not starting this uh comment is not starting any variable right so next time you want to you want to extract this sequence you have to write this line again but you if you want to store it into some variable very easy for example v you can give a name any name you want you press enter next time when you press v it will give return you this so it's easier so it just when never next time you generate a sequence you it's best to store it into some variable so next time you call the variable name it will give you this string you'll just generate it all right easy to follow now we already introduced this wrap function with a replicate it very useful now we use this sequence so this sequence just now i give you a shortcut right three two six actually a preview it it will be doing the same thing here so you use sec sec stand for sequence so what can you do sec from three to six by one what will be generated three four five six by one means an increment okay so what if you have a by right by zero point five it will be three three point five four four point five five five five six okay so here is a another example is zero point two by zero point two so you can also ignore this from two by because r is minus to take to let you save the trouble so let's do it sec one five zero point five if you know the first argument will be the starting number the second number will be the n number the third number will be the increment number if you do not have zero point five let's find the same if you generate one to five but if you have the third argument if you have the increment by zero point five okay so this is how you generate a sequence you can start it into any vector you want for example b next time when you copy it will return the sequence you just generated all right so you can explore that for example length out what's length out length out means whenever the number reach tree length tree it will stop okay but this is a more or less you less useful you just need to know how to generate a sequence using sec and how to replicate sequence using set uh wrap okay these two are ways to generate sequences then this is a shortcut just now i told you you can use just a column okay to do a shortcut but this way you cannot specify the increment the increment is by default one so if you want to make sure the increment to be specified you use sec all right then another one is you see here what does this line mean this line means you see the c here so this actually try to concatenate you remember what concatenate means concatenate means you add something behind right so add anything behind you just pull you can just regardless concatenate function as a bowl you you show ping pong balls inside so you throw one ping pong ball which is this one two three so the sequence one two three you see then separated by comma right if you remember then you throw another nine ping pong ball into this then you you throw another sequence into this so there are three ping pong balls each of them are called an element so if you return one two three which is uh defined by this line nine is defined by this line sec 2 to 3 by 0.2 so if we this line so if you concatenate that means drawing them okay then you have this new vector so they say how you how you manually input data into r and store it into variable this is the most common used function okay have to be familiar with all right now you already have your data stored in your vector how can you extract the individual element you can use a for example b this vector b you want to extract a value for example you want to extract this 2.5 or you want to extract the false value in this vector how to use how to extract very easy you can just press b use this square bracket okay square bracket is a way you extract you extract values from your vector you want to extract which one the first one you press 4 so where we return anyone can guess what will this common return yes 2.5 because it will return the false elements in this vector okay so this is after you install values into vector this is how you extract values or extract elements inside this vector by its position okay in r the position start from 1 okay unlike c c languages in c language or java language it will start from 0 for example that means in c when you press 1 that means it's a second element okay but in c it starts from the first element so it's just very intuitive okay then another function is called the length function length function will return you at the length of the vector that means how many elements inside that okay for example you want to press length b can anyone give a guess what's the length 9 that means the vector b has 9 elements inside from 1 to 5 separated by 0.5 so there will be 9 of them 1 2 3 4 5 6 7 8 9 this function is length function is very useful later when you want to do a loop anyone familiar with a for loop or well loop because in the loop you have to specify the first number the the length the the n number and the increment right the n number normally you want you do not want to count yourself right so you can just use this length function to extract the length so this is how you use it okay so length function is very useful in this sense so far anyone can follow okay okay now we can do arithmetic so now you have already have your vector what can you do with it you can play with it by numbers by doing basic arithmetic so for example you have your b like this so you have there are nine values right you have your a for example i give a a sec okay a will be 1 to 9 b will be 0.1 to 5 separated by 0.5 so they are of equal lines right so you can perform arithmetic so that means if you press a plus b can anyone guess what will be it will perform plus element wise okay so it will perform the addition function element wise because a is the first element in a is one first element in b is one so it will give you a two okay the second way it will be two second one will be 1.5 it will give you a 3.5 so whenever you do arithmetic in your vector a will perform arithmetic element wise okay corresponding so if a and b are not of the same length what will be now i give a to r length 8 then a plus b what will it be it will give you warning you cannot do this because they are of different length okay you have to make sure you want to perform arithmetic element wise you have to make sure every element has its pair okay so if you are of different length the system will give you warning okay you cannot do it see it's of different shorter lengths okay it will give you warning then if you want to do multiplication then i have to make sure a is the same right so i give another a a is this b is this if you want to a a times b it will give you an element wise multiplication all right then whatever or other things you can do you can also perform indexes so to the power of okay to the power of for example a to the power of 3 okay when you do power you use this this sign okay as power okay later i will give you a list of what are the how to perform for example time is a star plus it plus plus minus minus division is flash okay so we can do a a divided by b okay whatever new things start it will be represented here but so far we do not have no new variables so whatever variables are already designed it will start here so thus this region will show whatever variable you are just created it will be all start there so far so good okay other functions so apart from this basic arithmetic plus minus multiplication division you can also perform sine cosine tangent all this stuff so it's just like a calculator okay you can also use log exponential okay exponential is here so these are other functions built-in functions in r you can just call it but normally when you call a function you use a function name bracket okay by function name square bracket what does it mean so the difference between a bracket and a square bracket okay function name plus bracket is a function you need a argument inside this function argument argument means the input you put this input then you apply this function okay if you use square bracket for example function name square for example x you have this x a square two what do they mean it means a second element in this a vector okay so you have to make sure the differences between square bracket and bracket they do the different thing all right so the second element of the a okay which is two now everyone familiar with vector how to deal with vector okay matrix list and array it will be a little bit complicated and they are not commonly used in real life uh so it will just jump to it jump it to data frame okay most commonly used if you are interested what matrix serves lists and arrays you can find it yourself after a course but matrix and data frame are the most important things in r okay data frame is just you can just it's a list of vector for example just now it's a vector right vector i just said the the exam score of all the people taking the test for example the hundred hundred point ninety five point ninety eight points then what's a data frame data frame you can just regard that it's a collection of vectors you can just regarding regarded vector is a two dimensional one dimensional it has only length data frame is two dimensional it not only have a length it also has a rows okay so yeah in that sense data frame can store multiple vector then if you want to realize it it's just a spreadsheet in your excel okay so vector one dimensional data frame two dimensional all right so data frame is a list of vectors so that means you have more vectors but make sure they are of same length same thing if you are of different lengths this data frame cannot be hold true okay so each vectors will be treated in columns in r the default treatment treatment is columns so you can see data frame can allow different data type for example we mentioned data type you can have a numeric you have character you have a logic so you can see here if you use a data dot frame you store answer your a b c d can anyone tell me answer is a what type of data type yes sorry this answer it will store what type of data type stream which is a character okay because it's inverter commas okay answer is a string character string vector correct is a logic correct logic correct logic vector which is contains two or four mark it's a numeric vector so it will store values okay the are of different same length which are all five so when you store it then you call this data frame i cannot save i just it's fine so if you press this i will have a way okay it's it's it's a figure never mind i cannot copy and paste i want to do a shortcut to copy paste a code and paste it into the r studio by the figure it doesn't matter so this is how you design a data frame so the data frame will return you are something like this which is a table form and you still remember it will treat each vector as a column so you will have your answers as a column here correct or not as a column here mark as a column here it will be each of them correspondingly um listed then this one two three four five is just a um the number of it okay so far so good data frame so you can if you are very familiar with vector data frame is just a collection of vectors okay one dimensional two dimensional all right now uh already introduced data type and basic data structures now i will talk about the arithmetic operation which is only one slide you can you can just take note how to perform different way like addition subtraction multiplication division exponential exponentiation integer divide so you can see the effect modulo modulo means you take the you all know but this way less less use it less useful you you just need to know this five okay i just show it uh logical operator means yes or no equal sign okay so in any languages you should know equal sign and a double equal sign means different thing equal sign means assignment of value okay all you in r you can use arrow okay one single equal sign for example you write x equals to three this means you assign value three into variable x okay in computer language but if you write x equals to three what does this mean double r double equal sign okay this means you want to perform a logistic logical operation you want to check if x really equals to three you just now you already start three into x right so this next command essentially asking you if three equals to three okay is it is it true is three equals to three yes it will return true okay now if you for example you assign x equal to four equals to four then now you press x equals to three what will be fourth because now x is a four so this command actually essentially asking you if three equals to four of course not right if it will return your first okay so this is called logical operation so logical operation use equal sign add two equal sign okay this to check if they are equal then if you want to check if something not equals to something then you use not equal which is our exclamation mark equal okay then for example we want to go to here x not equals to three is it true or false true because x is four right so this this expression just ask you if four is not equals to three which is true right four is not equals three it will return you true okay so this is a logical operation this is very important in any programming language you have to familiarize with it yourself and also less sign so you also want to check condition check if something smaller than or something is greater than or smaller and equal greater and equal so all this you can just do a performance here for example now what's x x is four right you want to check if x greater than four or not so what will be true right because x is four so this expression just say four greater than three which is true all right likewise you can perform any other things so that's why it's our exclamation mark this means logical negation so exclamation mark in our language is a negation i think in c is also exclamation mark okay so still remember a is n a this function a is n a what will we do what it will do for you it will check if something is n a or not that means there is no value inside or not so is n a what's your value for example i have your my vector a i can just use my vector a is n a or fast fast fast fast or fast because n a has value sorry a has value it's not n a correct so if for example i have this i designed my own i can have a first value one second value two third value n a first value four so i press my a if you look at this right so if i press my if n a so it will return you the third element is true so third element is n a okay so it is if n a function will check if in your vector some missing value inside this is very useful when your next time you have a very big spreadsheet your boss give it to you you want to say if any missing value inside it so you can extract that vector out you perform this function it will check which which one which element has n a inside so it will without you eyeballing the missing value it will tell you where is it so this is a very useful function equal equal what do they mean n a will check for example if you put a if you put a variable inside if you check if the variable contains no value if you put a vector inside is n a it will check which element inside this vector has missing value 5.0 oh you mean this right uh a numeric this one if you so this numeric a special function that means it will just return you zeros okay so numeric five means you return a screen of length five all there is a it's a way of uh um when you when you're familiar with for example later you want to write loops for loops on uh uh wow loops when you have uh essentially initially you want to have a vector which contains no values of all there initially all zero values then whenever you run a loop if you update the values this is how you do it so um but most of the time i i prefer n a rather than a string of zeros okay can oh okay for example for here you already know a is this right i design my own function i design my own factor a i i specify the first value is dot one second value is dot two third value you don't stop the third position you don't stop anything n a missing value the first value position is dot four okay so this is my initial vector so when i perform this it's n a check it will check element wise if inside this each position there is no missing value okay first one one is there so it's not missing value right so is n a is it not available no it's not available no it's not not available so it will return a false okay so another false this third position okay it check oh at third position there is a n a so it's true it will return true so if you want to put a exclamation mark in front of it it will reverse everything so true become false false become true so this is how you use exclamation okay so far so good now we just learned how to write the basic vector how to do the basic syntax now what happens when you are given a data set what can you do all right now let's go to data handling in data handling whatever you teach you will be input output you when you are given a data set how you input it into r rather than just now just now whatever we do is we press it we manually introduce vectors okay but most of our most of our time in real life you are not going to do it you are given a data set so how to import data into r and let r to do it okay that's it input output when you perform your arithmetic so you perform your great your models how do you export your data into a data format for example txt or csv or excel so so you can open it in another editor for example you can open in csv all right so next i also also introduce you some built-in functions r has many built-in functions that you can or you can readily use it without you writing in the mode writing the function yourself but most sometimes you have to design your own function so i will teach you in r how to make a function and how to make a design your own function to seal your own need okay there are specific rules so let's go to input output uh a little bit messy but go to your r i give you a r.code in your google drive anyone found it it should be in your google drive you download it you open it you open with rcd all got it the rcode.r so r file will be storing .r format okay i'll have it just double click double click the .r file if you are opening by default it will be opened by r but you want to open it by r still do everyone got it yeah double click r still do open it by r still do all right everyone got it okay i'll see everyone have it so if you are using windows this is how you set your working directory because every every time you write a code you want to save it somewhere right and whenever you have input data you want to save it somewhere so you have to specify your working directory in r if you are using windows this is how you do it set wd set working directory then you will press c most of the time you have a c user username this is your name i don't know your what name you give it to your computer it will be your the name of your computer for example your name okay so this is how you set your working directory okay then this dot dot dot is you can just put desktop okay you change this username this is the name of your computer okay mac then you don't have you know you don't need to use this c dot okay you can just use this way this is how you change your username okay after that you put your input dot csv this input of csv is the dummy data file i created for you to practice so download this csv file you put this data file into the same directory you just design you're just assigned if you just now you put directory desktop clear okay for example my so for example my mac has a name called joshua so if you are going to go to my desktop so this is how you do it so now my working directory is set at desktop so what i want to do is i put this data file into my desktop okay then when i run this command line it will read in my data so this is how you do your reading of data this is a special command line it's called read dot csv okay another thing is that in r it can read excel from mac excel dot excel it can also read txt but most of the time when you want to handle it with r and when it's clean you use csv for mac you ever know what csv means csv means comma separated values so csv is the most standard input for r so next time whenever you have a data in excel for example you're given a data sheet in excel you want to export it into csv for mac first then you read it in r okay then you can have this line of code it's called read dot csv okay this is the r function to read csv this inverted comma will have your data file name okay next time if you have a sum data file called boss assignment okay the boss gave you assignment it will be boss assignment dot csv okay you give a name okay this is a user user changeable hater what hater hater means in the data file in the data file the first the first row will be the row name a color name okay that's called hater when you are given a data file like this express sheet most of the time you have a first row right first row will be the what what each column means okay those are called hater then the hater it shows true so that means yes there is a hater okay to tell the software there is hater set means separation okay so that's separated by comma because it's csv right csv is separated by comma but if you are given a txt file what happened txt most of the time it will be separated by tag right that means one column another column there is no comma but it's tag an empty space so then in that sense you'll change it into a tag which is nothing but for more user friendly experiences change it into csv they say my preference and I hope it will be everyone's preference okay so this is how you read how you read data this line you can copy paste it will be universal okay then you start without this it will just read the data with this it will store everything read into a variable called data okay so next time when you call your data when you write data here it will it will show you everything in this data and it will say a rich get option so that means this is very big so the the screen doesn't want to show it all so how to show it are very um next time when you want to say if you have already reading data successfully you can use this special function called hit hit means the hit of it so that means it will give us first six sample first six rows of it okay password there is no password set wd oh set working directory it's not it's not password okay a pwd is password okay it's set working directory to make sure your input data is is within here so it will read if you change it into another working directory then it will not find this input data okay because i saved the input data in my work in my desktop all right so hit the data so next time when you read in your data your data are very big the the software does not want to show everything it will you can use this trick use hit hit will find just return your first six rows so you can just have a glance if your data is imported successfully okay it will have say first column id 1 to 1 to something so my data set has 2000 data points second second column will be the age okay so you can you can know each row represent a person okay then the third column will be our gender gender is a uh one or two so male or female correct race will be one to four one is chinese two is malay three is indian four is others so it's a standard singapore demographic representation uh temperature means the body temperature okay heart rate means their heart rate respirate is a respiratory rate then bp low means blood pressure low blood pressure high smoker or not yes or no it's a binary yes or no zero or one okay vaccination if the person has been vaccinated yes or no one or zero virus means if the person already got this flu virus yes or no okay there are one so you already see my data set it comes from a public health point perspective so but it doesn't matter is whatever other type of data you do the same thing okay i just use this example all right so you know how to set directory you know how to read the in the data and uh you also need to know how to write data okay now yet i want to read in first okay now i introduce another thing it's called a dollar sign so far so good very am i fast okay after you successfully read the data and store the data into a variable called data you can also change it into another thing if you do i want you can change it into that okay i just prefer you to call data either to refer okay any variable name will do now i need to extract the columns right now you are reading as a big data frame as a big spreadsheet right i want to extract individual column i also want to extract individual element how so you have to extract individual column first how do we how to do that you use this dollar sign dollar sign is to extract individual vector in a big data frame or in a big spreadsheet okay so for example you use data you already have when you go to your hit function you can have you you can see the column names right it's called age gender race temperature so in data you can press data dollar sign it will automatically come out with all these things okay so if you know in data it has this column name you can press okay you can choose age okay what you what will it do when you press enter it will return a list of vector which shows all the age two thousand of them all right but still for simplicity this is a very troublesome we store something inside we store it will give a variable name rather than every time i use data dollar sign big age we give you a variable name called age use a small a or big a whatever you want age so next time when you press age it will give you this theme screen again theme vector again all right okay so essentially what i do here is i do everything i give a new name okay age gender race vaccination i use whack virus i give virus temperature i use temp i read hr smoker or not you smoker okay so this is how you read datas so now you already have your age do you remember age so what kind of data form what kind of data structure age is age is a vector right it is it has a data structure of vector what type of vector data is not numeric okay it's not it's not character it's numeric so if you want to perform you want to check what is the age of the 10th person in this data set 61 so in that sense that the big data is a spreadsheet you have rows of columns you use dollar sign to extract the columns then you use this square bracket to extract element inside the columns so this is how you extract individual data inside your data form okay so far so good now you already have your uh kana okay so now i already introduced you how to extract data how to for example just now your big spreadsheet you have multiple columns right but you are interested only with only three of them you are interested only with age generates okay you want to create another new back another you spreadsheet with only this three column of course you can open in excel delay the tree okay this is a normal way but in you are you can do this way but this is a good example because next time when you do you have your own data file you only want to export certain columns this is how you do it you have two ways to do it one column one way is called this c bind function c by means column bind okay that means you bind these columns together okay you'll bind each gender race then you then you assign into a variable called out okay this you can give any name you can give an output out my data or everything okay then you you can all this line uh then you when you write dot csv previously you read in csv right you read dot csv now you write write dot csv then out will be the variable name out top output dot csv will be your output file name okay you want to give a name for output file so what will be what we do is you read you run this column oh i haven't i have not run my all data okay i'll stop here so you see all it generates will be stopped here it's called integer okay int means integer one to two thousand means they had a length of two thousand each of them is individual value then you press this c bind c bind write csv okay now i go to my desktop where's my desktop out here see it generates a new file it's called output right then you open it it will have only the three columns you just manually selected okay can see okay you can also use data.frame c bind and data.frame do the same thing you can also perform this row of data you can also perform this row of uh command lines data frame is doing the same thing data frame you remember is uh will put column uh will put and will join all vectors into a big spreadsheet right data frame this is what you do so it will put each generates as a data frame and you assign to all so it will perform the same thing just your preference okay no no difference one two three four five it's a serial number you can delete that that's how our store number if you generate the uh the sequence the position so the extra column you can just delete it all you want it you want to keep it you give a column name called id anything in hand right you call me this one you can delete them or you can just put id there for your reference okay there's a number which number of it there's extra column they are auto generate one can so far so good input output this is how you do it the two lines you need to that universal so you can just copy paste whenever you use okay building function this is very interesting part what r can do with all this stuff so i just give you an example for example hit tail just now you see hit return you our first six and first six rows of your big data five big big data set tail means the last six okay the end table what's a table means table means it will it will help you to do a table count for example um age what is this can anyone tell me very complicated right so in age as you remember the person will be from age 18 to age 80 okay so it will count there are 37 of them of age 18 32 of them age 19 so it will just count for you so it's easy right very useful function if you are very big string it will help you to count okay so it do a summary for you in excel what you can do is you sort them by age then you you can you can count sometimes you'll come manually but you are it will come for yourself you see roughly they are about same every age it's quite uniform distributed that's why this is actually my makeup data i purposely made one so uniform distributed okay okay let's say how table do table give you count summary count summary what summary later in later i will teach you regression summary very important but now don't bother uh summary too much length that's how i show you return you the length of the vector so if you have your length age what will be what will be the number b 2000 because my age has 2000 elements i have 2000 rows 2000 elements okay you can see from your variable 1 to 2000 okay i have 2000 element inside so the length will be 2000 salt what salt mean salt will be the just now the age is all dispersed right so every age you can see from the data file here the age is very random right salt give you the sorted number for example you press salt age it will stop by age okay it will go from the but this is from smallest to highest if you want to do a reverse you can press salt age comma the direction will be the reverse so you can do it whichever way okay so sample your object is not fun you have not run you see your age is not here you have to run all this thing you stall them yes i don't know not all just the variables you run it okay so it will be here then you can perform all this thing there has to be stored first okay so far so good all right sample what sample means sample is very interesting function that means um for example sample sample means you just draw lots so next time when you want to do uh for example sample 102 what do they mean that means from zero to hundred i want to draw randomly two number okay that's how you do sample sampling okay essentially sampling so that means so next time when you run it will be expressed the same value no because it is random next time you run it will be a two different number okay so a sampling is so you want to do hundred turn so if you draw number from zero to hundred turn of them randomly okay so they do sampling okay in r when you want to do model sampling is very important but uh you can next time you want to this is kind of in excel the random number generator okay so they run the number generator using sample sum product very easy h this is a h right can anyone tell me what's the sum of all this thing you do not want to do one by one right you can just do a sum h okay so it will return you the sum of it prop i don't want to do this this is prop product product it will be a massive number okay hope my computer can handle infinity so it very big okay two thousand of them each of them is two decimal it will be very big so do a small one uh product c two three four okay i create a vector myself i want to do our product of them two times three and four product 24 okay uh means means uh all these are statistical statistical uh ways so you are given the age right you want to say what the basic age distribution like you can use mean which is the average in excel 48 okay across the two thousand people the average age is 48 what's the mean what's the median 49 okay then what's the maximum number 80 minimum number 18 okay the sd var core can you guess what's that sd standard variation that's our standard deviation okay call you do not use call call its correlation later then you want to do the variance variance is actually square of standard deviation right okay these are all basic statistics so you are given a vector you want to summarize the basic statistics this is what you do okay sd y okay variance standard deviation floor round what's that so for example round means round up or round down so you know 4.5 will be 5 right 4.4 will be 4 so it's a 4.5 still going down all right so this is round to round the round to the nearest number the nearest whole number you can also do the floor floor floor means the nearest whole number which minus 1 right ceiling will be the top floor will be the lower so this will be 4 then you can also do ceiling it's fine okay so this is a basic operators so you can see more in rbuildingfunction.pdf i already uploaded the file so i have a more list of this readily built-in function in r okay you can find it in this pdf okay this is how you can use r to perform all these basic arithmetic arithmetic now make function is interesting this is part you want to i want to give us some exercise for you in r how you make your own function just now whatever thing you just call for example mean bracket standard deviation bracket all these are ready made built-in function that means the function inside is already saved in r language so you do not know how it performs in the internally but somehow when you input you give it an argument or give an input it will round this function and return you something okay this is internally built but what if you want to design your own function for example you want to calculate your bmi you all know bmi body mass index right your weight divided by your height times square right you want to calculate the bmi of a certain group of people how then this is how you design your own function this is a language format function my function name you can give up any function name for example bmi okay equals to function this one you have to write function input one input two input three these are argument so that means what are the input data you put into your function if you can have two if you want to calculate bmi what are the two input h weight weight height right so you have two input or you can regard that argument one argument two statement then you use a brick bracket statement one statement two statement three these are all the arithmetic you perform okay return an object so what you return the return object is this function give you what after you input the data they return this function will give you something right you you give a argument to this function after the function rounds internally if you all put something for you so this return thing is the object so you want to return one so let me do a demonstration hey no break this one after this will be break okay i have already read the basic make function one for example you can have a look here function name my function name i give you a call bmi anything you want any function name any any any name you want function standard it's highlighted in pink right so that's called syntax highlighting you have to read function bracket you have two arguments first argument is called weight second argument is called height okay you can also write a or b you just but for the easier to read you give your name okay height weight or height then you give your then you'll do your perform your perform your arithmetic the brick bracket body mass index bmi equals to weight divided by height times square okay you remember how to do it this is very easy right weight divide by height bracket times height time time height time time okay so they say they are internally function then you return what you return you return this bmi right you want to return a value called bmi so this function you design your own function it's called bmi then you write okay now you design your own function rule next time when you call your function like bmi height what's the weight what's the height for example height weight 60 kilogram height 1.7 this function will return you are 20.7 this how you get the number if you essentially put this 60 corresponding to your weight put your 1.7 corresponding to your height then if you perform this weight and height correlation then assign it to the body mass index variable and return this variable then this return variable will be this number so this is how you design your own function you can more make it more complicated you can have more things inside okay so this return object is a numeric number right okay just now in my slide i say you can return a numeric or you can also you return our logic number logic that means you return true or false so we do an update of this function from call function 2 i want to check i give i give you a weight i give you a height i want to check if you are healthy if you are fit or not according to the moh guideline if you have bmi of 20 to 25 you consider fit so i just want to do give you me your two values i want to check if it or not i don't want to spot i don't want to spot the value myself then i decide i want the computer to decide for myself so how same thing you have your weight you have a height then you design a function it's called body mass index how you calculate right then i put this fitness as a force initially i assign a force to this fitness fit or not okay if you are fit then i will i will replace this force to a true if you are not fit it will return a force okay so this is called essential essential value initial value assignment most of the time we give a force first then if you are fit then we update okay so now i use a condition check if statement every computer language will have a statement right if bracket body max index greater or equals to 20 okay this double end means condition and that means in logical and that means you have to match both function you have to sorry you have to match both statement condition yeah okay so and body max index smaller or equals to 25 so this is my condition okay you have to match both then if this is matched i will assign true to fitness even initially i assign force to it right it will be false but later if this is true i will assign true to it so fitness now become changed but if this is false if you are under 20 or greater than 25 this command is not wrong because condition check you have to match this function then you run this line if condition check does not match this function it does not match this condition this line is not wrong okay so essentially the software will give you return the fitness it will return to our force okay easy to understand this function then we run this function now we have this function called bmi fit give you a weight for example 60 height 1.7 true that means if you have this height at this weight you are fit okay if you have this 1.9 too skinny okay so it will return force so you can see this is how you build your own function it can return values it can also return this condition check okay sorry this kind of a logical true or false so this is called your own program okay you can do with other things you whatever you think you want all right break yeah two hours that's nice okay any question you can ask me you have two hours break right until two i think it starts from two to four right wow okay then i can we can release early all right everyone clear right how does he cannot open have you put into the that's how right small small d is big the rest is small d is big the rest is small yeah then you run again can i go to your desktop right here that says we're right i think um is that you don't use the name you run you go to your cc go go to your c very this is a small one your wife big okay so uh let's start early so for session two i will introduce about the regression models so this is very important models um when you deal with data because whatever data set you are given with normally you want to find the association between the data set and the regression model is kind of models that tell you the relationship between two data set and also i will after this i will teach you visualization that means i already said r is not only very powerful in terms of doing statistics it's also very powerful to plot figures beautiful figures okay so let's start with regression model so before i come with introduce regression model in detail we have to understand what are the data types regression model uh work with so normally you are given a data the data can be of this qualitative or quantitative type so if you're qualitative it can be either binary or categorical so binary that means logical yes or no uh zero or one so only two cases categorical means they are categorized so it can be for example race eight you have a chinese malay indian or others so they are categorized they are not values they are so called when you match it to the computing language is kind of the characterized character um data type and the binary uh data type refers to the is equivalent to the logical data type okay then qualitative uh quantitative uh means uh either the data can be continuous or discrete continuous for example the height it can take decimals so you can um you can uh infinitely divide the data into um smaller intervals so you have a 61.0 or 62.3 something like this or it can be discreet discreet that means it can only take certain numbers so like for example shoe size you can have a size 40 or size 41 you won't have a size 40.5 you won't have a size 40.3 for example so this is a discreet numbers so uh well after that i will introduce regression model with very in mind about the data type okay later you need to refer back um uh frequently so regression models what is a regression model they are statistical methods to investigate the association between variables so when i say association it's not causality okay this it must be very clear association is not causality two data sets associated does not mean one data set cause the effect of the other okay uh so i will introduce two type of regression models um simple linear regression multiple linear regression or multivariate regression another name and a logistic linear regression so i will teach them one by one uh a little bit of background statistics so what is a simple linear regression so it actually is a is a model that want to see the linear relationship between a variable y this y must be continuous variable okay still remember it can be only continuous it cannot be logistic it cannot be categorical okay y must be continuous and a single variable x so this x can be any y so that means uh can be any type it can be categorical can be continuous can be discreet can be binary okay so y has another name sometimes when you see y can be called a response variable or it can be called dependent variable or it can be called outcome variable okay so these are all the different names given to y x can be called a predictor or it can be called independent or explorer or explanatory variable so this is a name next time when you are encounter this kind of name you know which one refers to x which one refers to y so uh what's it a linear regression model linear regression model still remember anyone learn based like secondary school's math you should know linear relationship between y and x right y equals to ax plus b correct so this is a y and x i refers to the y is the y here the x is the x here the a and the b i use another annotation which is a beta 0 and a beta 1 okay so you can regard the ax y equals to ax plus b the the formula you learn in secondary school the a will be the beta 1 here the b which intersect will be the beta 0 here okay then the this small sign here will be the error term which we want to minimize so whenever you want to do model model that what a model is model want to see if it matches the reliability but there are always uncertain between model and the reality so this small error term capture that and we want to minimize this so that the model will capture the reality as as close as possible so this is a basic formula like this essentially the same as the secondary school y equals to ax plus b okay so we want to use a linear line to fit the data data scatter plot okay so in r what's a code in r so in r you use this line of code l m linear regression model y title x okay let's go to your code here so still remember previously everyone already run that data code the input dot csv right we're still we are still using that data as our demonstration to full of demonstration so still remember temp means the temperature hr means heart rate right correct so i will run this linear model so i want to model so temp will be the y variable hr heart rate will be the x variable so we want to see if this x h heart rate and this y temperature are correlated so you think about it heart rate and temperature intuitively if your heart rate beats faster you should have a higher metabolism rate right so you have higher temperature right intuitively so this should be positively correlated and we want to test it if it is true okay so how to do it l m temperature heart rate run it run this line so the linear regression model data will be stored in this fit variable and you can give anything you want you can put a fit you can put a model you can put anything you want this is just a variable to store the to store the values then now you use summary summary is just now i mentioned about this is a built-in function it will give you the summary of this model so you run the second line it will give you this a lot of stuff i will tell you what what each of them mean so you can see here it will give you the residue which you do not bother your things you need to pay attention is here these two values as an estimate okay the estimate here this is called the intersect right 3.71 okay the trauma 71 times 10 to the power of 1 so which essentially 3.71 3.71 is the intersect so in our y equals to ax plus b this 3.71 will be the b correct and this estimate corresponds to the heart rate this estimate is about 0.0098 to the power of minus 3 right so 0.009 this one will be the coefficient of your line equation right which is a right y equals to ax plus b right so b is the intersect which is this number intersect this one will be the a which is a coefficient of this this variable so this is a okay so you can also need you also need to pay attention to this here p this value is very important for you to determine if this model really fit still remember anyone learn basis statistics what's a gold standard of something is statistically statistically significant p smaller than 0.05 right this is really really small this is a turn to the this it means two times turn to the power of minus 16 this is definitely smaller than your 0.05 okay so that means this estimate is very statistically significant you can also just look at the stars you may know see here the intersect it give a three star so that means it is very significant all right so uh anyone can follow okay so um so in my code summary.fit just now what you have is summary.fit and confidence interval fit what does this line of code do you know anything comes with uncertainty when in statistics okay 95 confidence interval you heard this term before that is kind of you are 95 okay i'm using leman language but this is not statistically correct so you have 95 confidence that your true value line within lie within this range so you have when you run this confit dot default fit you can have 2.5 percent 97.5 percent that means we actually in close about 95 percent of the whole data and you are 95 percent confident that your true estimate will lie within between 0.08 to 0.011 still remember previously we calculated the estimate is how much 0.009 right this is the estimate this is 0.0098 okay so when whenever you want to report you also report confidence interval this is our practice okay so um now i actually give you example in your advance but right i i want to show you three examples so still remember i say the y variables or the all count variables must be continuous okay then then by the x variables can be any data type so i want to demonstrate with you three examples continuous y and continuous x which is essentially just now what i did here continuous y and continuous x so body temperature is continuous variable so it's a continuous type data type high rate it also continuous so that means you have a different data you is not categorical so we after we run it you can have see here this is a print screen it's a essentially the same thing just now i perform you have this value you have this value so one you want to fit into a line so you have all these thoughts you want to fit into a line do you know like what how the data looks like when you have a when you want to correlate them for example by right i should teach this i shouldn't teach this this is a little part uh oh okay i see okay so um this is actually what the data is 2000 points okay so what is essentially a regression want to do is you see this is a each individual point corresponding to their temperature and high rate okay so this is how it looks like for the 2000 points and uh linear regression a model just want to fit a line that's try to explain this all these uh patterns so i don't want to fit a line here okay what's a line characteristic so you have to know the slope you have to know the intersect with y axis so this essentially tells you so this essentially tells you about what is the slope what is the intersect so you can write temperature which is a y variable equals to 3.71 which is the intersect which is the b y equals to b plus ax right plus ax x is the heart rate the coefficient of it is 0. 0.987 okay so this is the equation so the r software just using one line of code which is called lm bracket y title x it will give fit this line and this line will try to explain every element in that data set okay what does this number mean this number means this equation means when there is a heart rate increased by one whenever you have heart rate increased by one then your body temperature will increase by how much this much right because every time you increase by one your body temperature will increase by this amount your baseline is 3.71 okay and the p is p is smaller than 0.01 okay this is a practice because it's too small we don't write the exact p and this is a confidence interval so we are 95% confident that whenever you have your heart rate increased by one your body temperature will be 3.71 plus this number times your heart rate so we can find this linear relationship so next time so this 2000 data point will be kind of a training data set next time and then I generate this relationship so next time I'm given your heart rate for example a random people working say okay I have a heart rate of 130 per minute or can you guess my body temperature then I just say your heart rate I say okay your body temperature is this roughly and this will be 95% confidently true for most of the time okay so this is how you do linear simple linear regression very easy to find the relationship between x and y okay so exercise you already know like how to do a continuous y and continuous x what happens if you have your continuous y and binary x so your x variable will become binary for example I ask you um I give you an example your y still be a body temperature and the x will be a virus status for example have you been infected by a flu virus right intuitively if you are infected by flu virus your body temperature will rise correct you want to test it if if it is true do it your exercise how can anyone guess how yeah actually I read it down don't look at it don't look at it here block it do your own code this is my first one right the general code looks like this now what you need to do is change heart rate to to virus right because previously what are the variables here virus is thought in our variable called virus okay it's a binary still remember still remember I use a table to do the summary for me virus is actually a binary okay yes or no so about hundred plus people have virus okay so how anyone can tell me essentially what you need to do is just change the heart rate to virus okay very easy okay so how it looks like you want to see if the temperature and the virus how they correlate it cannot see right because it it does not look like the previously the cluster right previously the cluster of them but now we have two lines why because you're very your yx variable is only binary okay then how to find the relationship you still use a linear regression model which is a temperature is a y response virus is your x response x variable you run this code you store it into fit two okay previously we use a fit now we use a fit two so it's a different model so we store into different variable so when you give a summary of this fit two which is also here okay it's the same here okay body temperature virus after you run it data virus will have this intersect to 0.25 sorry this is estimate the beta this is your slope okay this is an intersect so essentially you can write the linear relationship between virus and temperature temperature equals to 30.38.1 plus 0.25 slope this is your intersect this is your slope times the virus so this one times this virus virus data so virus data can only take 0 or 1 right all right so if you take 0 this term cancel right all right no this term so your average body temperature is 30 38.0 and 38.1 so the across the all 2000 people the average body temperature is 38 okay oh the confidence interval here you run this line confidence interval after you run the summary right you also need to run the confidence interval that means you because the summary give you the estimate it's a just single number the estimate the slope the slope will be uh this number 0.25 this is low but any model any model come with a confidence interval okay so it's not a fixed number because they want to do a model model want to match a reality data set so there will be definitely some uncertainties okay the uncertainties is captured by this term called 95 percent confidence interval okay then how to calculate the 95 percent confidence interval you use this line called confidence interval okay dot default then your start model then it will return you this two number 0.008 and 0.11 okay so actually this is for the fit 2 sorry because we are doing the second demonstration so this is our confidence interval 0.17 to 0.33 so we found we found out that the estimate is 0.25 right so when you report we will report you report when various status is 1 so if 0 or 1 right if it's 0 that means average temperature 38.1 if you are 1 your temperature will be 38.1 plus 0.25 right so that means if you have various you are more you are 0.25 degree higher than those people do not have various so this makes sense it's uh you get various you likely you get fever okay so and this is made up data uh I just this is not real data I made up by myself one so um real data you shouldn't have expect people have normal temperatures at 8 right so um just for demonstration if do not do not see the the data makes sense or not and p you you have to look at here whenever you have this reported uh a which is a beta estimate you also need to look at the p if the p correspondingly is smaller than 0.05 okay if it is then this is statistically statistically significant estimate if not then this model does not fit you do not use the model okay the model tells you rubbish okay so this is your this is your answer how you report data another one what if you want to do a continuous x and and category called y category called y huh inside a data set which are the variables uh those are categorical can you give me an example you'll go to go up all of them race okay race is categorical because it's one two three four still remember okay one two three four they refers to chinese malay indian others okay but when you run the model you see here is called factor race why I have this extra extra um term called factor because this factor tells the computer that race is a category called variable if not if you remember race is represented by one two three four okay if without this factor the computer will think race is one two three four numeric okay because one two three four can you want it to represent a categorical so you use this factor to tell the computer this one two three four does not mean number one two three four they are mean different four and they mean four different category okay so this is how you use factor then you run it summary tree look at data so you see race one is not here correct it only tells you race one race two race three race four race one is not here so race one is our chinese race so the software will take one first of them at the baseline so you have you compare the baseline with others so um you just okay later when I teach later I will teach this in detail how to interpret the data but now you can just think chinese is not there but now you only have malay indian others okay look at the p first before you look at the estimate p are this significant no they are not significant all of them are greater than 0.05 right so you do not have to anything you can just conclude race does not determine body temperature okay that makes sense right no matter what race you are you your your body temperature won't be dependent on your race you cannot say if you're malay you have higher temperature if you're chinese you're lower temperature right correct so this tells you body temperature does not affect sorry race does not affect your body temperature there are no association between the two okay just look at the p p significant significant then they are correlated p not significant no they how you this is how you make conclusions based on the p all right so here I say race based on here you can see the p are all greater than 0.05 so they are not statistically significant race one chinese is the baseline so since all are greater than this they are not significant predictor so what you can conclude is there is no difference in body temperature among different races so this is how you make conclusions okay even though the p is not significant you can still draw conclusion the conclusion is there is no association okay now let's discuss simple linear regression just now I we already see three type of we see I give you three examples of a linear simple linear regression so you have y one y which can only be continuous okay you can have one x x can be categorical can be binary can be continuous okay whatever data type you want it can all give you results but you already see just now I use a high rate so you remember first example high rate with your continuous high rate is correlated with thermal temperature you remember first example higher temperature a higher hurry higher template body temperature second we use a various status as example various status can also affect body temperature if you have various you have higher to body temperature but which one are really causing the body temperature to rise there are two factors right individually various status and hurry both can cause positive higher body temperature but which one is really causing it so you want to do this test you want to know which one really cause then you have to do this multiple linear regression not only one single parameter get me so what you'll do is you introduce this multiple linear regression you see the equation here is a little bit different than your previous one previous one you only have this correct you only have one x one x variable now you can introduce more because previously individually you know various yes possible to cause body temperature to rise then your raise no but your uh what's the first one uh the the high rate can also cause right so you want to put them into a single linear regression then then it is called multi thing a multiple linear regression so you can actually means everything you're done being okay you want to see if they have a any intersection interaction okay very easy previously only one to one now a many to one okay so that means essentially your y variable can only be continuous i want to emphasize again y must be continuous x can be any type then you can have multiple more than one x so that means you have more x to fit the line this is called multi multiple linear regression all right so in code code in r what to do same previously you only have y title x now you have y title x one plus x two plus x three whatever thing you want to put in as many as variables you want you just use a plus sign to add them together okay though let's go to the our code multi very linear regression okay multi vary and multiple is the same so we want to see if virus and high rate both affect temperature okay individually they are affecting temperature now we want to see put them together where high rate and virus put them together we'll assign it into your multi very regression okay give a name any name run it run it run it what can you see here okay this too big look at the code it's the same thing already prepared here so your y will be your body temperature x you have two x x one high rate x two various status so you have this outcome you have your heart rate you have this estimate virus you have this estimate are both significant yes right these two numbers are definitely very small this e to the power of minus 11 okay this that means zero zero point zero zero 11 zero okay it's very small so with this what can you interpret the data each of them are the estimates this is a intersect just by uh reapplying uh re substitute the value back to your equation it will be temperature equals to 37.13 estimate okay it's a 3.7 to a term to turn to a power of one right so that means i this number times 10 so the 7.13 plus this is a heart rate heart rate times this value come from where come from here plus various status times your 2.5 okay 0.2 0.25 okay this is 0.25 so this is a new equation so this new equation will tell you when you are given heart rate and various status i can use this two variable to predict your body temperature okay if there are only one how so now i say with various if with various this one will be 0.25 right with various virus means one right am i too fast can clear can follow right with various various status will be one so there will be 0.25 you have 0.25 plus your 37.1 you have 37.35 38 okay so this to evaluate add to get this equals plus this term so without various without flu virus you go there is zero so this term cancelled you only left with 37.13 plus this so can you see now you have extra information so when you want to calculate or predict your temperature i can use a heart rate to to predict your body temperature but i ask you an additional question if you have do you have various infected if you have various infected i can modify my prediction by a little bit then i use this equation to calculate so that you see this is more more information or more detailed rather than just you give me your body temperature and you give me a heart rate i want to calculate now i ask you an extra deformation do you have various have various then i modify my equation a little bit yes or no sorry this is a coefficient estimate means a coefficient estimate this so this estimate this beta is called estimate okay so this beta 0 is intersect okay so this is x1 which is our heart rate this x2 is our virus beta 1 and beta 2 are the estimate which is this two number okay so you see now multivariate regression analysis actually give you the breakdown further breakdown to describe your data so you see it's easier it's more accurate in that sense because you have two independent variables that affecting your body temperature then now i take account of both variable okay so there's a multivariate regression discussion so in both simple and multiple linear regression why must be continuous you remember every time i do this i emphasize why must be continuous but in some cases why may not be very continuous what happens for example why it's binary for example you want to see the outcome variable because previously the outcome variable is body temperature right why variable why is body temperature is continuous variable but what happens your outcome will be your for example ring or not does not ring ring or not ring or not yes or no you want to predict this for example you want to predict does the gend for example let me see eat some drug okay cost death or not for example you in a clinical setting the doctor give you drug then the drug i may have okay death or no it may not be sounds good say fat okay say fat yes or no so you eat the drug the eat drug or not is your x right so x variable which is a binary yes or no then your outcome variable is also do you have say fat this outcome y is no longer like temperature which is continuous this outcome variable y now is binary yes or no so you want to see if eating the drug cost your say fat or not so this is a x and y correlation this y will be the binary y okay then how to deal this kind of problem situation where your y is no longer continuous by rather a binary we use logistic linear regression okay a little bit problematic in terms of mathematics but I want I just want to show the math but back in the in the internally how it's done you do not have to bother about this you just need to know how to use it okay so why when your y become the binary yes or no or there are one you change the y into a log p divided by one minus p p is the probability of getting the problem for example in a list previously I gave you either drug cost do you have a okay okay I try to you just think about two columns okay one column is x which is a either drug or not there are one there are one there are one one column is a y which is a do you have a say fat which is the outcome variable okay y yellow one there are one there are one so for example you have hundred of them how many of them are one how many are there or you can count right for example in half of them you have 50 people half say fat oh for example not 50 30 30 of them have half say fat then what is the probability of people getting say fat 0.3 right correct you have 100 of them 30 of them get so you have a what's the probability of people getting say fat then you 30 divided by 100 right 0.3 so essentially if 0.3 will be the probability then you sub 0.3 to the p this is a p so one minus p will be the 0.7 then you use this divided by this this is called the odds ratio there's an odds of getting getting the say fat okay then you do a log why do you need to do a log here you do a log transformation the reason why because you see a x variable can take values I say x variable can take any type right so when you want to fit a model your x and the y the range must be the same when I see range is data range so x value variable can take any number from negative infinity to positive infinity correct but now your y can only take 0 or 1 correct so you want to change your y or transform your y into a range from negative infinity to positive infinity how this is how we do the math trick probability can only take 0 to 1 there correct then 1 minus probability can also be a number of 0 to 1 you use the range from there to 1 divided by 0 to 1 you get a 0 to 1 0 to infinity right correct so without a log this part the value the value range will be 0 to positive infinity okay but you need y value from negative infinity to positive infinity how you do a log right you do a log transformation then log transformation will change there to positive infinity to negative infinity to positive infinity okay this is a math background but you do not need to bother this I just want to introduce you this read odds ratio so instead of writing a y here okay for logistic regression means your outcome variable y is binary instead of y you use this thing log p divided by 1 minus p the right hand side is the same all right so I introduce a term which is called the odds ratio so the odds ratio will compare the odds of outcome will occur in an exposure to the odds of the outcome will occur result exposure just now I explained about the 100 patient thing 100 patient out of them 30 people have set effect so the p will be the country the 1 minus p will be done by 7 so their country divided by 0.7 this is an odds ratio so this odds ratio describe the odds of the outcome which is people will get set effect with an exposure which is eating the drug to the odds of outcome will get a set effect without exposure without eating the drug so this odds ratio they how you to interpret this and when you run you see because now your y is the odds right it's a log ratio so when you want to calculate the estimates which are essentially previously what we calculate the estimates you need to do exponential right because now you do a lot you want to refer it back you have to do a exponential correct so whatever beta here you have to the exponential to get your odds ratio okay so I will okay this is a little bit mess but I will teach you how to interpret so how to do it in our same thing now you remember previously we put lm see the difference now you have a g in front of it previously use lm lm means linear model okay now you have g lm g means generalized linear model because the function want to generalize it into more you have more powers okay so this is the difference use g lm instead of m lm then your y variable which is your what binary right the x 1 x 2 x 3 you can have all your different input variable now you need to put this extra extra thing which is called family equals to binomial binomial distribution you heard that before that means yes or no right so you want to make sure this glm using a model which is a binomial model you have to include this line here so how to do it logistic regression for example I want to see the various status for example I want to see the various status and the vaccination back means vaccination I want to say if you have vaccinated before do you still have infected get infected by the virus okay intuitively what's around there should be if you're vaccinated before you shouldn't get the virus right so how do how do we test it we can have a glm we want to see various status which is a binary do you have it or not this is a y variable whack is a vaccination with the x variable family is a binomial because you tell you of the uh r to see this is a binary equation this is a logistic regression okay this is not linear regression normal linear regression then you do this thing run run you can see here is the p significant yes okay first thing first you need to look at the p p and not different if p is not significant you do not have uh do not need to do anything though so this model is invalid okay if it is significant then you further try to analyze it if significant then you look at here estimate this estimate still remember this is a log estimate you need to do our exponential still remember because this is a log odd ratio so uh unlike previous the simple regression and multiple regression you just take this estimate as your beta as your slope or y equals to ax plus b the a instead of just plug this number in this number you need to do our exponential okay that means e to the power of minus 2.35 a 53 okay but this you do not have to do our manually you can just log drawn this line this line cof that means coefficient okay coefficient means this right this is a coefficient this means the estimate estimate coefficient okay in the our language is cof this is your model you use the cof to extract this number you can also type you can also extract okay then exponentiate so this 0.08 will be your odd ratio okay then later i will teach you how to interpret odd ratio all right so you have to get this number out first instead of just look at this minus instead of using this number the estimate coefficient in logistic regression you have to exponentiate first okay how to exponentiate you can do this you can manually do it okay you can also call it for example how to call it is just this this equation cof this equation will extract the number this function will extract the number out then you'll do an exponential you do not want to type then you can just call it okay so this 0.08 how what do they mean so this is a 0.25 minus 0.25 do not use it directly in logistic regression you need to exponentiate first okay so the odd ratio will be 0.08 how to interpret the interpretation will be people who have received the vaccine are 92% less likely to have a virus because now you want to do odd right odd says do a comparison so 0.08 means people who get who have the vaccination are at a 0.08 region who do not receive vaccination are at a 0.92 region so you can think about one minus this number so the odd ratio will be one minus the odd ratio will be the person who who received the vaccination is it confusing is it confusing a little bit okay you all know how to get this number right this number is just when you do the exponent when you run the model it will return your estimate then this estimate you cannot use directly you do the exponentiate because logistic regression like the name suggests is log right logistic log when you do log you want to refer back to the original number you have to exponentiate okay you log a number you exponentiate a number it will go back so exponentiate the number will return you are 0.08 this 0.08 is the odd ratio this odd ratio actually tells you people who still remember how i define odd ratio people who receive and who do not receive right odd is p divided by one minus p so p is people who have it one minus p is who people who do not have it so this odd ratio actually tells you the proportion correct p divided by one minus p right is the proportion so in that in another word is the odds of people who get the vaccination at 0.08 uh region then this will be another word is to say people who get the vaccination are 92 percent less likely to get the virus this 92 is one minus 0.08 okay so this makes sense people who get the vaccination are 90 percent less likely to get the virus than those who didn't get the vaccination so this data tells you and also tells you the exact number okay so this is very powerful in terms of interpreting data you want to know if the vaccination work or not yes this will actually really work and make make people 92 percent more protective against the virus okay you can apply it to another other scenarios this is how you interpret odd ratio okay binary y binary x what happens if you have other type of data binary y continuous x and binary y categorical x can you do exercise yourself for example you look at the data data initial data apart from the virus status who else is the binary your data right still how do how you see your data hit right you want to check your data virus is 0 0 0 who else is 0 or 1 smoker or not right if the if the participant is a smoker or if you see a vaccination smoker and the virus status they are all 0 or 1 okay so we can use example of smoker we want to see we want to see if the age is a factor for smoker okay age is a continuous right so we want to see if age affect the smoker smoking status that means if age determine if your state smoke on up okay intuitively yes or no no right by rest of them no if it's not saying like if you're getting older you are you are more likely to be a smoker right so it's independent so intuitively age and the smoker are not credited at all so we want to test it if it's true so how to do it okay so this is a line very easy same thing glm general linear model smoker is a wire response you want to test if the patient if the participant is a smoker or not there are one binary age is continuous okay then family is binomial you run this line around this code binomial is binomial means it's a it's a logistic regression to tell the computer it's a logistic regression family family means this like generalized formula generalized linear model can have multiple families it can be binomial family can be Gaussian family can be Poisson family so you can have a different time later i will teach you this by this i will just briefly talk but now you will need to need to know logistic regression is binomial because yes or no one or zero okay so you run this then you store it into a logic rack to this are you any any equation any form variable name you gave it okay you run a summary anything when you get a red value when you get the result first thing first what you need to look at p okay look at p first is it significant no it's not significant so what does it say no really no association okay there is no association between h and smoker status so in that sense in another word is h does not determine if you're a smoker or not okay makes sense right okay so aesthetically it says so as well okay so here i do the same thing p is greater than 0.5 not significant our conclusion is since p is greater than 0.5 h is not significant predictor for smoker status all right next binary y and categorical x binary y we still use smoker as an example categorical y a categorical x who is a categorical race right in my data set race is categorical then we use a factor still remember we cannot use we cannot cannot just put a risk here if we just put risk here the computer will take one two four at numbers okay there are individual numeric numbers you use a factor to tell the computer race one two three four are categorical okay there are this one two three four are not numeric they are categorical okay then you run this line first thing first what do you look at p correct you don't have to read the number you just look at the star the more star the the more significant okay three star means definitely significant okay so all of them are significant predictors so race is a significant predictor for smoker status but how to interpret look at you do not have one here right you have only two three four because this is logistic regression logistic regression you have to have a baseline you have to have a base to compare with so this this two you have a this estimate right 1.68 this number actually compare with risk one so your risk one is a baseline so you want to come you have this estimate this estimate actually is a p you want to compare with the risk one so that actually this risk two is male you actually this number tells you the odds comparing male versus chinese then this one is a tree is an indian this number tell you the odds ratio comparing indian versus chinese the first one is always a baseline okay this one is others status which is the amos okay then they will have this odds compare with chinese so but this can you use this number as odds no you have to do exponential okay so you do exponential of all this tree how you just do exponential you get all these numbers these are the odds numbers comparing male with chinese indian with chinese amos with chinese how to interpret the number this 5.3 means the probability of you being a male sorry the probability of you being a smoker and at the same time your male are five times compared to the chinese can get one one thing chinese is a baseline okay so whatever value you get here is a ratio the ratio means you want to compare something with something the with something is a chinese so this this result essentially says race is a significant predictor for smoker status and in this data site male are five times more smoker than chinese so that means if you are male you have five times more probability being a smoker than being a chinese being a smoker so that means male like to smoke in that sense and the how likely they are how likely they are five times more than chinese so in the data set if you go to the data set yourself you probably you will end up with uh you dig up with all the state uh smoke status and dig up all the resources you try to tabulate them okay this is you tabulate them you get how many male they are smoking how many chinese they are smoking the number will be about five times but uh by tabulating very troublesome do a linear regression will tell you okay and similarly if you are indian you are about three times more likely to be a smoker than chinese in my data set i'm not saying it's generalized in my data set if it says so okay i'm all also so in this data set chinese are like likely to be a smoker okay then how you interpret data can the important things about logistic regression is whenever you get your estimate which is a beta you have to are exponential then the exponential is the odds ratio and odds ratio when you say it's a ratio it must be something compare with something right something against something the bottom part with something that that thing is a baseline the baseline is always the first thing in your category which is a chinese chinese is one right if you change it if you want to represent this one be the male then it will be another way around but normally you have to define a baseline and default to the first line the first category in your categorical data is a baseline clear okay interesting result right so uh now i am i will do some summary regression models today we have learned three different regression models simple linear regression so the y response must be continuous okay continuous means they are numeric and they can take any values x variables must be a single variable and it can take any type binary category or continuous multi linear regression means when you have more y more x you want to see if the every variable x you want to add into the equation you have multiple of them so you use a multiple regression okay that means multiple factor rather than one single factor so these two are essentially the same this is just a upgrade of this then now previously we only see y must be continuous what happens if y is categorical oh sorry if y is binary okay if y is binary then you have to use logistic regression model okay this the mode the which model to choose totally depends on the data type that's why i initially i emphasize on data type you have to recognize your data type first to determine which model to use and uh y x variable can be a single variable or can be a multiple variable so we do not care about what x is and i don't also don't care what x this type is okay just look at the y y binary or y is continuous and the important thing is when you use logistic regression you get the estimate you have to remember to exponentiate it to get the authorization all right okay so generalized linear models just now i will talk about the you asked about the family because to um binomial right so that when that when you use that when you deal with binomial when you deal with binary outcome or logistic outcome okay in the generalized linear model which is a glm function the family can also equals to gaussian gaussian means a you know gaussian means normal right gaussian gaussian distribution normal distribution gaussian gaussian then uh you also have the prozone distribution you heard prozone prozone means count okay whenever you do count data for example number of occurrence for example number of uh disease cases this year number of cars per minute number of students in different classroom okay count data you use prozone okay so it depends on your variable if you know your variable follows the certain distribution for example now you want to do a um imaginary uh model for example your variable now become your height you want to see if gender affects the height for example girl or guy if the gender affects the guy the height intuitively guys hire higher uh on average higher a taller than girls right so you want to test uh so your x variable will be your guy or girl right your x variable is binary your y variable is continuous correct which is a height number but this way you do not try to use because the height does not follow a straight linear line so remember if you want to plot the height previously all the numbers when i plot it generally looks like a straight line right the other dots are scatter plot generally look at us straight line but height is very strange height follow a normal distribution in reality if you want to plot height people's height it looks like this it does not look like this okay if you go up and go down because you cannot forever go in height right so some kind you go up then you go down so the height follow a normal distribution you know this prior knowledge so when you try to fit a model about h that's about gender and height you want to fit into a Gaussian distribution so in that sense you use the glm y is your height tidal your h comma family equals to Gaussian then you use a Gaussian distribution to fit this model okay this is how you do if you use your you know the you know the prior knowledge of your y distribution for example now another thing i give you a y y will be the number of cars per minute then it will be count data right it's another way it's not it's no longer a normally distributed right it's proz only distributed proz on is a count data you all know right so we want to fit for example number of car or number of disease cases per month for example dengue fever in singapore every month about hundred to twenty different hundred to twenty dengue fever cases in singapore but uh you want to see across one whole year 52 months that's like 52 weeks each week how many of them got dengue fever so you'll go to the moh moh website you download the weekly case data so this weekly case data what kind of distribution is likely to follow proz on distribution because they account data right so uh you then you use proz on and then you use a glm y will be your number of cases per week title so we are x variable probably weather okay if you have a bad weather for example rain or not if you rain probably you have more stagnant water so you have more mosquitoes so you have more cases so you want to test if the weather data cost higher number of um uh dengue cases so you may want to have this glm y is your number of cases per week title will be the weekly rain status rain or not or even more even if you do not want to use the rain or not as binary you can also use the exact number of water water for okay how much how much precipitation okay in terms of mm okay millimeter okay then you want to see if higher rainfall cost higher number of people getting the disease dengue fever then what are family use since your y is account data you know prior knowledge it should follow a proz on distribution then your coma family equals to proz on okay so these are different types so i give an example here gaussian regional regression that means if you already know your y definitely follow a normal what are the things follow normal distribution most of the things in nature for the height weight um for example uh what else uh test score okay test score also follow a normal distribution so all these things uh you know then it's a cop prior knowledge then you can choose in your glm the family equals to now we use equal family equals to binomial because y follows our binary right if y follow a normal then you use y family equals to gaussian if your you know your y follows a proz on then you use family equals to proz on okay you can get all this little minor details go online or i can also i also send you a upload a document what are the different type or different um extra argument in your in choosing your model okay uh then the model will not capture the data because when you do a model when you do a model you want to use the model to try to fit your data so the data will follow definitely follow some distribution for sure because if you have more data it either follows a linear relationship that means forever increasing for when x increase y also increase so when you scatter plot them if they will like always stop a linear line so the linear line is a linear regression model okay but sometimes if but it's sometimes for example height it does not forever increasing okay if you go up then go down so when you plot them the the scatter plot it will look at something like this so you know the height definitely follow a normal distribution then you choose your your glm br gaussian distribution so you choose a model to fit the data if you choose the wrong one then you you will get some result but the result is not valid how do i know if it's valid for example if your your your y response is binary you definitely cannot use a normal distribution to fit it because it will return your result it will return your arrow because your y is not continuous so if you use the wrong one sometimes the system will prompt you saying wrong but sometimes the system does not give you the hint for example if you try to really fit a normal distribution to a problem count data can you remember in jc statistics you learn sometimes problem distribution can be fitted as a normal distribution when your mean greater than 50 zero remember that product distribution the the the argument is a mean right product po bracket mean the normal distribution is an normal bracket mean comma standard deviation right so you still remember your jc stuff right so sometimes these two models these two distribution can be approximately equal when you have certain condition match okay if your sample size big enough normal distribution is akin okay lm is linear model linear model only deal with straight line okay glm is generalized linear model so you can have binomial linear gaussian poisson glm is superior over lm you can also use you can always use glm but the default without specifying the if you do not have a comma family equals to something the default is gaussian gaussian yes so if you just want to do a linear regression model without all this normal or poisson you use lm okay glm default without writing the family equals to its gaussian family equals to poisson yeah you do not write the family equals to something its default is a gaussian you write the family equals to binomial then it's logistic no no no you don't have you don't have to okay we spend one hour talk about the regression model already everyone clear about what regression model can do so regression model in in sense it tells you two data sets they are correlation or their association okay you want to see if one column of data correlated with another line of data and you want to find the coefficient you want to find how they are correlated for linear regression you can tell by a y equals to a s plus b you can just based on the first one you give any value you can predict the next one based on your equation you obtained this is linear later i also tell you are logistic it's different because your x or your y variable is binary you cannot use linear regression because it only has two outcome zero or one so you have to a transformation by using the log p divided by one minus p which is a log odds ratio so odds ratio tells you something about the p divided by one minus p so it's a compare this with this so you have the baseline that's why just now i teach you how to interpret when you interpret the data is what is the probability of getting this compared to the baseline so this that's why i analyze the data i say the malaise has higher rate has higher probability which how much higher five times higher than chinese to be a smoker okay the baseline can be changed it's troublesome i don't want to i don't want to confuse you but by default the first category is the baseline which is the category one is your baseline all the first thing appeared in your data set yes you can also change the baseline to be a malaise for example then the outcome data the outcome then your x will be your then your risk two will be a chinese for example then the outcome uh ratio will be 0.2 right because malaise is five times more so the chinese will be 0.2 more so there is it's actually corresponded okay all right linear regression way very simple idea but it's very powerful method in statistics and you are very easy to perform two lines of code now it comes to visualization so i just now i said r is not only very powerful on statistics it is very very powerful on making beautiful figures okay by beautiful i really mean beautiful okay so now i will teach you some default functions embedded in r then building functions to plot normal normal way to plot just give you a sense what r can perform another one at once packages plot i didn't introduce my code in the advanced packages i just tell you there are packages available for you to plot much nicer figures and this takes a longer to learn because it's code so for example in r when you want to draw something in my i use grid uh package called grid i draw for example i want to draw a square okay i don't tell the computer to draw a square i tell the computer to draw four lines i specify the four lines is this line start from zero zero n at zero one this line start zero zero n that zero minus one i draw four lines to to plot the figure so this is more user friendly user customizable but it's not user friendly okay okay so these are some figures that plot in my in my research papers so um you can have have a rough look what r can the type of figures r can plot you can see right so it can plot this kind of bars lines uh this is axis it can also for this plot box plot to show the maximum number minimum number i don't show the outliers it can also draw this um pet this uh curves the lines dispersion of points scatter plot of points it can also draw this roc curve roc curve you heard before curve under the okay never mind uh i'll see curve to either way to test if your linear regression model is significant or not it's uh added added test after you do linear regression then this is uh when i intersect uh a stack staggered um bar plot so you have three different categories you stagger them together look nice right so this is r can what r can draw so it's much um in my opinion much nicer than when you're drawing excel right because all the colors you can specify okay so these are just preview but this is more this usually more advanced are packages i will teach you the basic one that just for visualization but more advanced if you want to learn i can give you the link okay so default functions these are the default functions for r for example plot let's do it real time plot when you write plot what plot can do still remember your what what things what are things you want to plot for example here i give you some example cars okay i have this vector starting car you run it then you have this trucks you also run it so you have cars it will return all this do you remember this a vector right last session you see concatenate to introduce vector cars trucks okay so you want to see the you want to plot them the the cars you just press plot i will tell you what are the individual things here so this is a figure using plot it will generally automatically appearing in the right hand side the panel the horizontal line will be the indexes so you have five index right you have five number vertical will be the number how many cars okay so plot cars so you can just look at here so this is a variable you want to plot comma type is small o okay small o means small o means the dots you want to draw is a circle okay so you see the circle is a circle is not it's another solid dot you can change the type later i will show you what a different type of dots you want to you can also draw a triangle you can also draw a square you can also draw a mickey mouse as for example if you install that then color col means color okay it can be a blue okay you can specify a name you can also change it into red then it become red okay the color does not only mean you can only choose red blue black or something in r you have the very very big power to change whatever color you want for example the color has if you want to change it to still blue right okay many colors i have a list of them about a few hundred or thousands of the colors you can just choose a color you want okay white limb white limb means white limit okay white limit means zero twelve what does it mean yes you define the y axis from zero to twelve okay you limit it this is you can also limit your x axis but since x axis doesn't matter it's a index so this is how you plot using a basic plot number uh use a basic plot uh equation the next you also have your tracks right you already plot this plot will give you a panel for example like this with the grid with the x axis and y axis now you want to introduce the tracks in on top of the cars how then you're now you do not use the plot again you use lines you want to draw lines right if you use another plot it will superimpose this it will cover this a new plot but if you want to introduce lines on top of this plot you use lines okay lines for example here what do they say type still small o that means this is circle okay and not solid pch what do they mean pch means point characteristic okay point characteristic equals to 22 i have a reference to for you to change 22 means square okay you have this okay this is a this is a when the software developer they design it so it's stopped inside it's building function so ph pch means point characteristic equals to 22 means this is a square okay lty means line type okay this is a short form for line line type you could do two two is dashed line one is solid line two is dashed line you can just change it manually color you already see red or you want to change it into purple as well now become purple whatever color you want so go to my slide this is a ph pch okay so you can choose they just know 22 right 22 is square you can choose any number you want you just do the corresponding one so the point type will change accordingly a line type lty if you change it to two is dashed three is the small dots you can change it as well okay so they say how the user define it okay just correspond correspondingly change the name regarding color just now i said you can use comma use red blue purple yellow all this word right but sometimes if you want to be more um fancy you can also define a color by its rgb or hex number you know rgb right red green or blue so it's a composition any color is a composition of three uh element color so you can define rgb equals to there are one there are one there are i think is white yeah then you can also use a hex hexagon number some number some colors are represented by this hexagon number okay so it's represented by a shot with uh some hexagon number representation so for example you want to go to my google drive art chart so these are all the different r colors you can see the gray different gray colors these are how how the r can have different colors you can choose whatever one you want now comes with a name okay so you can use for example coral color one you can use call seal so we have a multiple a variety of colors for you to choose okay so this is you this is specified by the name comma this is specified by hex number this is specified by rgb you can use different way to specify the color y limit yeah so you specify the y come from 0 now i put a 0 to 12 right you can change it to minus 2 to 12 there you see what's happened you try you can change you can just play with the with all the different code for example here i i change it into minus minus 10 see the effect your y axis will change from minus 10 to 10 12 so this is how you change your y limit normally we change see this is how you change your y okay colors black okay type big oh type uh so one is circle without the type o that means it's solid okay can see so this is a different way to play with it so you can also change the 23 20 okay this is a point characteristic equals to 22 right you want to make it a triangle so it's two then you just change it to so it's triangle okay you can just this is a very user friendly and customizable then you want to add a title to it to your figure okay so your mean mean is a when you define your title okay the mean rate it means our column when your what color you want your title to be font size means a font size 4 okay but here you want to output this figure right this figure appear at right hand side of the panel you want to output this figure how to output you have two ways to output one way go to your figure export save as image save at pds okay but this way of output figure is not the resolution is not good the resolution may be very low okay so you want to make your resolution to be your you want to save this figure and you also want the resolution to be set how you go to here you can use this this line of commands i pre put i pre-defined png with an output file called png figure out figure the png means output file name okay you give a name height ways you can define yourself you want your you want your figure output to be 10 by 10 cm you need the same resolution in 1000 point size it's 10 and that means all this one you can define yourself then depth of that means you open the file you close the file okay so you can just put all this thing you put your figures command in between of these two lines then you run all of them it will generate a figure and output into the directory you are working with okay it's done so you'll go to your uh desktop here see it will be output here and the resolution will be much better can see when i zoom correct so this is how you output your figure you can define a figure resolution yourself for the color i say the color chart another website is called a color brood two color brood two is very very useful website this color color brood website will give you a combination of colors where when you make your figure so example you want to make bar chart or you want to make a pie chart then you want to make the color looks nice normally you write red green blue very the the default color will look very ugly okay so you want to make sure you choose a set of color they are compatible and they look nice so you can choose a color here for example you want to find a sequential color that means from lightest to the darkest you can also choose a divergent color that means the five color you want to for example you want the class to be you want six different colors then you also want the diverge the color to be divergent so you can choose the color scheme you want so you see these five colors when they put up together looks very nice and they are compatible you can also make sure uh you want to know if you're the reader is colorblind a princess is a print friendly photo photo safe so that means this four color scheme will be colorblind safe and a print safe print friendly so that means if you print uh in black and gray this four color can still be differentiated okay so whenever you have these four colors you see the four colors each hexagon values are here you can also want you can also write a rgb values so all these values you can just copy when you plot your figures okay for example hexagon color i want this blue okay so when you plot see the light blue here okay okay so advanced packages so advanced packages include those pack you need to install some library so i i said in the first session r is very extensible so a lot of the packages available online you can just download them for your specific news so in r there are some good graphic packages for example ggplot2 are great how do you install this library you'll need to go to here packages plot packages then you press install so whatever package you want for example great okay then i try to install it says it's not available because it's already done it's already downloaded here okay my i already have the great package so if you do not have you may want to install that ggplot then you press install it will it will download the package for you then next time you see your ggplot is here when you want to use it you you put a tick here then when you want to plot then you just put a library bracket ggplot because whenever you want to use a specific um you want to use a specific library or packages you have to use this called library you use this library just like see you use include um certain packages okay so in r the same so now let's do some plot for example you use this um plot tricks you try to install this can can any of you install this this one normally you do not have it by default you have to install it go to your packages install you type the name then you press install okay so now we try to plot some pie graphs for example we use slices so we design these are the different slices it's a vector right how many elements five okay you have five vectors and each of them you have a all you have another vector it's called label it's a character array right it's a character vector okay it's a country name then you plot pie you want to plot a pie figure pie slices so slices are the things you want to plot right it's based on the number label will be the labels will be the label right the names you just define as a character array the mean will be the mean will be the figure title so you press this it will look at this right looks nice right the default color also looks quite nice but you can also change the color you want you are also plotted in 3d you just need a pie 3d okay the default color you can change how to change do this okay still remember you can use this to change you can specify the color just add extra argument okay the order of the argument does not matter for example you can write your your color to be here doesn't matter the order does not matter as long as you you specify the this is a keyword this is the argument name you want to specify now we can also do a another scatter plot attach empty cars so what's this attach our empty cars empty cars is a default data set preloaded with this uh uh r so you just because some people just want to do a test so do not have a test data set so just this r function this r language has this preloaded data type for you to load for you to try try an arrow so just attach this empty car so if you have something inside then you use this library scatter 3d plot plot 3d okay you can this are all different libraries you you may want to use if it's not in your packages you try to you try to install it scatter 3d plot okay install okay uh scatter okay so it's here then you press stick then you want to scatter plot attach scatter plot so this is what does it say is wt is the weight of all this empty car looks like this so there's all the different car names and Mercedes masvidas all the different things so it's a lot of data so weight the weight will be the um okay those i don't want to explain this but you see this is 3d actually what are the 3d things weight will be the empty car is a car itself which is uh um indexed dispersion with this one is dispersion mpg is a mpg here so actually you have three dimensional data you have three things you want to plot at the same time you have to do it in 3d okay so this 3d scatter 3d plot will you give you the 3d pch so you remember point characteristic pch equal 16 is a solid circle highlight 3d all this small small stuff you can just find um help r so it will tell you about uh what an argument it supports type h h means you have to draw a vertical line then you give a name 3d scatter plot so the 3d plot looks like this looks nice right you can you can specify a color there you can for example like for example color okay you specify into you want you have five of them right so you have one to red blue for color you have to make sure it's character okay sure sure sure you can just change it into this way you just copy the hex uh you you only have one color i'm demonstrating your five right the pie color right for example i see can see correspondingly you have all your colors right here so you want to define a color color must be a character defined right so you can define the color outside you can also write this whole string inside here but my preference is you write it outside then you then you call this color here so color equals to color so this color this color a call equals to color then this color how you define them is here okay it is a character right it's character vector and each of them is specified by value by a value on name you can also write your name for example apart from the you can also write for example qualitative okay sequential for example like the this orange you just copy the hex value put it here okay five colors together you run it see this orange is changed this is how you define the colors but you have to make sure these are corresponding so this one refers to us so they are they are corresponding the order are corresponding okay you want to make this five color very nice then you go to here for example you want to choose a divergent divergent color okay they look nice right five of them divergent this one looks nice right this five color combination so you can just copy this is a hexagon value it's a same usage as a you use a name see you have the five colors it looks these five colors are more compatible each other so you can choose the color yourself the color brewer that website is very good website for you to choose color because they have they give you this color combination which they look nice when they put together all right then what also you can do is like this way what can you do like this kind of figures so this is r not means are these are are using a normal distribution so r not one thousand that means that you give a one thousand number then this one thousand number will follow a normal distribution default there are one okay so r not for example you write r do you remember sample let's session one correct we use sample hundred ten if you sample ten values randomly from a sequence from zero to hundred but if you want to sample a distribution how to sample a distribution so you can use r not for example r not r not means are normal distribution one thousand of them zero one zero one means you still remember normal distribution has the mean and variance i mean and standard deviation right right basic r and basic statistics if you if you know this normal distribution have two parameters one is the very standard one i mean one is a normal one is standard deviation so you specify this if you generate r data which follow normal distribution how for example you want to for example you put it into a you want to plot them i put not plot uh hist histogram okay histogram means what histogram means you you draw all the bars right histogram so a follows a normal distribution yes it's true i sampled one thousand points and when i draw to draw a histogram it really looks like a normal distribution and how many of the breaks you want now it has about 10 breaks you want more sure breaks equal to 20 yeah okay the i think 20 is the default the most so you can fit you can five breaks looks at this you have 10 break look at this the more breaks you have the more the more normal distribution curve it looks like okay they say how you visualize it default you do not have to do all right so this essentially gives you a normal distribution of zero and one so default zero and one it doesn't say it does not specify but right if you write you should write this way so it will generate two normal distribution x and y it will also write to a hex pin hex pin means how it want to draw a hexagon and in a data set it will just plot plot this what does it mean that means all the random dots superimposed to each other at certain points certain points are seven of them superimposed or six of them superimposed so this is just basic functions r can do so you can explore all this different way of drawing drawing figures oh that means uh you have how many points now a lot of points right but somehow that means uh 150 points will consider as one okay so if you have 50 superimposed but this one you do not bother this is just some fancy figures you normally most of time you won't use it yeah okay so now you'll go to your workbench i upload this ggplot 2 ggplot is just now i said it's a more advanced packages you can view you can just look at the follow the argument what kind of figures you want to plot it can so you can just roughly have i look if i want to put this kind of figure what are the things you need to do you need to put what are the input data what are the arguments so ggplot this figure will tell you a lot of things so bar plot uh trend plot circle plot uh many different types of hip then this is a hip map different things you want to plot so this ggplot uh cheat sheet will tell you all the arguments you need okay so you can just follow this this is a more advanced plot previously i just teach you all those are basic um you just will visualize your data uh what else okay recap so in section one what i teach you is the basic syntax of r and how you deal with the data that means when you have you are given a data in the form of csv what can you do first to load the data how you extract the data and how you export the data so in section one i teach you all this in section two i teach you how to do basic statistical modeling which is regression model regression model tells you the association between two two column two vectors or two different variables you want to find our association there are three ways of finding there are three ways and also you need to know depending on the outcome the variable you choose a specific model then also tell you about visualization r r can plot beautiful figures so uh still remember if you want to plot figures you then excel you have to specify that you have to manually select where to put your big legend where the legend the size you have to do all this troublesome stuff right but in r it give you very beautiful figures straight away okay it can give you data from your csv file and directly output figures and output figures you can show them on your on your on your panel here you can also output them by using this line by using this figure out this line and this line you put your uh your drop figure code in between it will ultimately automatically create a figure for you and it will be very of good size and a good resolution okay then that's all for today thank you so we have 15 more minutes you can ask questions or you are you are free to talk so um so by right this should run out a plot of a chart with all the colors yes yes but it's not happening it's running but it's not my territory hey what happened to your console question yep okay i was i was trying to um do you have your h yes you have uh it should be appearing here close it open again i'll try to close it and open again okay so i put my email address there if you have any question you can write me