 In this course we're going to present data in the form of a flat data set. The most common example of this is a spreadsheet and you can see an example on the screen now. I'm using Microsoft Excel but you can also use other spreadsheet software. LibreOffice comes to mind it is a free and open source package that can do nearly the exact same things as can be done with Microsoft Excel. Here we do have Microsoft Excel and I want to show you what the idea is behind what we call a flat file and how we store data in there and what the philosophies are around doing this. What you'll notice of course is that there are some column letters up here A, B, C, D, E etc. and then down on the left hand side we see the what we call row numbers. So row 1, row 2, row 3, row 4, row 5. So these are the columns so if I were to click on D up here that highlights the whole column that is column D. If I were to click on 5 here that would be the whole row 5. That gives each and every cell an address. If I clicked on this cell it would be in row 5 and column D. That would be its address D5 or when we start working in software we're more concerned with the row value first and then the column value doesn't matter it has that address column D row 5 and here's the philosophy. We place the variable names inside of the head of each column in row 1. So we see in the first column we have a header called column headers we refer to it sometimes as ID and we've already populated a bunch here it says study underscore XR3 underscore 1. So this simulates data from some study called XR3 and what we have on each of these rows would be the entry for all the variables for a single patient. So this was patient number one, patient number two, patient number three. Now throughout the courses that I do make for healthcare statistics and research we're going to assume that the patient's identity and all of the issues, the legal issues surrounding that has been taken care of. It's different for every country in every region and everyone should be familiar with what the legalities are of what can be done inside of these files what data can be kept, what the permissions are, what the ethical rules are for keeping this data. I'm assuming that that is all sorted out. So we have in this case our patients by number that is their number that data is kept separately so we can have access to perhaps their patient folders if we need extra information. Next on we see the age column. So age would be a variable and we can type in an age for this patient. Let's make this patient 35 years of age. Now age is a numerical variable. It's a continuous numerical variable and that means we can do certain statistical tests on that. The referral for us would mean where the patient was referred from and let's suggest that this patient was referred from what we call the city region. Now that city region would be a nominal categorical variable. We have three regions in this study. It's city and then inland and coastal. Those are the three regions from which a patient can be referred and those are nominal categorical variables but there's no order to them and they are most definitely not numbers. Now let's move on to the next variable and that is temperature. Let's assume this to be the admission temperature and let's suggest that this patient had an admission temperature of 37.0 degrees. Once again we would see this as a continuous numerical variable. The admission heart rate let's make that 80. Once again continuous numerical variable as would go for the white cell count. Let's say suggest that this patient's admission white cell count was 10.1 and then we want to know whether they have diabetes yes or no. Let's suggest that this first patient of us did not have diabetes. The only two data point values that we can add there's yes and no. There's no actual natural order to yes and no so that would be nominal categorical variable. What grade of disease that they have? Let's assume that the disease that we are testing here comes in four grades and let's suggest that this patient had grade two disease. Now what is that grade one two three four? Is that numerical? I would suggest no. I would suggest that this is ordinal categorical. The reason why it's not numerical is that there isn't a fixed difference between the data point values one two three and four because I cannot say that someone who has grade four disease is twice as worse off as someone with grade two disease because two times two is four or the difference between grade one and two is exactly the same as the difference between three and four. That has no meaning. Those statements have no meaning. So the numbers are actually just placeholders. I could have put A, B, C and D there. They are not real numbers. They are not numerical values. They do have some natural order to them. We would accept that there is some natural order between someone who's got grade one grade two grade three or grade four disease and so we would call that nominal, I should say, ordinal categorical variables. The patients received three groups of treatment and there was a placebo group. Let's suggest that the first patient was in the placebo group which we code as P in our example here and then the outcome. The outcome is whether they survived or did not survive. So at the moment I'm going to just say code this as this patient survived. This patient survived. And lastly is months. That would be for us a variable, again numerical variable. How many months until they were last seen? Let's suggest that this patient was last seen 20 months after they were randomized into the treatment group and they survived. That's the last time we saw them which means they might have died at month 21 but that was the end of the study we do not know. So as a recap, these columns are all variables. The variables contain data point values and those data point values are all of a certain type. Here we have numerical, we have ordinal categorical, continuous numerical, continuous numerical, continuous numerical and this is nominal categorical again, ordinal categorical, nominal categorical, nominal categorical and again, continuous numerical. And that is what we would do. We would fill in a row for each patient. So that would be patient number one. That would be for patient number two. And this is what we refer to as a flat file. We have columns which represent our variables and rows which represent a subject in our study. And what we want is for all these values down a variable to be of the same type. I should not type in 20 year for the next patient. That would just be a word. We want to keep this at 20 for the next patient and only choose from the data point values for that specific variable. Also for referral, it'll be city, coastal or inland. So all of these should be either city, coastal, inland, city, coastal, inland, etc. So we keep the data point values down these columns exactly the same, but each row would be one subject or one patient in our data set. And this is a data set that we would then import into our software to do our statistical analysis. In the next video, I'm going to show you a very easy way to create some simulated data right inside of Excel here or any spreadsheet software for that matter. How to create some simulated data so that you have something that you can work with, that you can use as data to do your exercises with.