 Let's look into research in private practice. I'm doing this video lecture series on research in private practice to entice those interested in doing some research, but also to help those who read medical journals to have a better understanding of what they are reading, to make more informed decisions to analyze the articles that they are reading. So in this first lecture we look at the types of research and as opposed to a normal introductory series on medical statistics, whereby we look at median, mean, mode, standard deviation, in terms like those, we can to start with a p-value, probably the most important concept in medical literature. So in this talk we'll talk about what this is all about, what is medical research all about and why should it be done. We look at the types of research that you can embark upon and some introductory statistics, the p-value. So here the why of medical research, why give a lecture series like this, why be worried about it or why think about it. Well, there is much more to medicine than just service delivery. We sit in our practices, we sit in our rooms and we see patients. There is more to medicine though than just the service delivery. There are questions that need to be answered. There is critical thinking. There is also the need to audit your practice. It is a very good practice to look at the outcomes of certain management plans that you have of certain ways that you treat patients. You have to look at your own outcomes. So if you don't do your own bit of personal research, you're not being a proper healthcare provider. As far as the knowledge of statistics goes, if you have some knowledge of statistics, if you have some knowledge of research, you can critically read a paper. You just don't just have to follow others' lead. You don't have to follow the opinion of others. You can critically look at that research paper. You can critically decide whether it was done properly and what to take from it. You don't just have to read the conclusion in the abstract. You can dissect the article and take something proper from it. What about private practice specifically? While you sit on an enormous amount of data in private practice, your files are filled with reams and reams of data that can fill many hard drives, it is sitting there and it might be and probably is of enormous value locked away in a file in a cabinet. Now consider where most research is done. If you open any local research journal, most of the articles will be from academia, from some academic institution and not from the private sector at all. Look at though at what the private practice has to offer versus academia. Usually academia is in a setting where there is severe financial and resource and personnel constraints. There is not access to modern first world medical care in many situations yet that is where most research is done. In private practice, most doctors have a first world setting where there is access to the most modern best forms of care and data from that is absolutely important and the research done on data from this sort of setting is very important. Why aren't medical journals filled with it? Lastly in your private practice you have questions. You have seen so many patients with a certain condition or in certain situations and you often sit by yourself maybe late at night, maybe watching TV, maybe before falling asleep, waking up in the morning, taking a shower. There is something that bugs you about a certain condition. Why not research it? So if you want to embark on that or just more properly engage with journal articles when reading them, just think about the different types of studies you can do. There is the observational studies, there are the clinical trials and then other which is usually two of them will get to those. So observational studies, four types, case series, case controls, cross sectional and cohort studies. Now you can read definitions on any of these and they are all slightly different. The reason being these are not clear cut divisions. You can have aspects of one type of observational study inside of another. Observational means, whichever way you put it together though, that you are observing some trait. You are not necessarily engaging the patient in a certain management. Those will get to later. So let's look at a case series. So it's a very simple descriptive account of a characteristic. Sometimes it's also known as a clinical audit. So you have a defined time period and you look at, from start to finish, record some data on a patient set. Perhaps during the epidemic, there might be an epidemic going around and during that time you record some aspect of patient scene, the way that they present and just report on those. That's a case series. It is of enormous help for future planning. You can really use a case series to critically look, critically audit your practice and therefore be able to plan for the future. Let's look at case control as a second form of observational study. Well that starts with the presence or absence of an outcome and then we look backwards. Now don't get confused now between retrospective and prospective. That's not exactly what we're talking about here. But you start with the presence of an outcome, a certain disease, a certain outcome and you look backwards. You compare subjects with or without the outcomes and then you look for risk factors and characteristics that differ between the two groups. Now prospective and retrospective simply refers to how the data was created. You might say that from today I am going to have a form that I fill in for certain patients. You don't necessarily have to have a plan in mind with this but there's certain pieces of information that will always be captured and that goes into your files. Years later you might decide I want a case control study and you're going to look backwards at patients that have already been done. That data though was collected actually prospectively you decided beforehand what data was going to be collected. Retrospective just means I'm going to try and find the data points that I'm interested in from the file so that data was never designed to be collected in a certain way. So that's retrospective. In any case we're dealing with case controls so you're looking backwards and comparing two or more groups to see if they differ in some way. So an example would be you give some the same antibiotic to the same disease and you have a group of responders and non-responders and you have numerous data points, wide source counts, types of cultures, whatever that you're collecting to see is there a difference between the groups can I identify characteristics or even risk factors that are different between the two groups. Let's move to cross sectional studies as a form of observational study. So this particular observation is made at a particular point in time. It's not stretched over a long period. A good example of this is a survey. You might design a survey regarding a specific disease or management and hand it out to your patients. They can form part of that study and you might test their knowledge or their attitude towards a certain healthcare topic. A cross sectional study can also form part of many of the other study types. So as I say don't see these as clear cut, completely distinct types. The last one cohort can also be involved in these. Let's look at the definition of a cohort. A cohort is a group of individuals with a common trait who remain part of that group over an extended period of time. So it's these lengthy studies usually to look at the risk factors for common diseases like diabetes or hypertension. We think about the framing him studies, so much data in those study sets and those patients formed a cohort. The second type is a trial, a clinical trial. Now here we're going to have groups of patients and we're actually going to do something to them, operate them, give them a medication, do some investigation on them. In these clinical trials you have to have controls. So patients can be a self control, you can have a data set before and after an intervention. We can have external controls and these might be of a historic type. There might be a cohort of patients that you can use as a comparison between them and your patients set who actually had looked exactly the same initially but had some form of intervention. And then there's the gold standard, the randomized control trial where a group of people with the same characteristics will be taken and they'll be randomly selected to have an intervention or not. And then those groups compared to each other. The other that I was referring to, the meta analysis and the reviews. Meta analysis, just groups, similar articles together that looked at the same problem. So you'll have an enlargement of your numbers. Now modern literature really loves meta analysis but there's one large problem with meta analysis and that it would only look at articles that actually have been published. The problem we deal with in medical research is that there might be an enormous amount of data that no one ever publishes because the outcome was negative or there were no difference between groups. That never gets published or hardly ever gets published. It's only when there are large differences statistically significant p-values. Usually these things get published so we'll always have this bias but that's something inherent to science and we have to live with it. And then lastly reviews where someone purely looks at the known literature in a specific subject and these make for excellent articles when you want in-depth new knowledge on a topic. So let's move on to the p-value. We're going to discuss software, the role of some dice. We're going to look at different forms of data, most specifically continuous data and then lastly some combinatoronics, combinations in the central limit theorem. So this is quite a dangerous slide because it might scare some people off. Relax though, make it quite easy. Let's look at the software, Microsoft Excel. Now most of you will have Microsoft Office installed and most of you will be quite familiar with Word but there's also second piece of software, they usually next Microsoft Excel spreadsheet program. If you used to Microsoft Word jumping to Excel is not a big difference. Now I have to warn you there's a difference between the Microsoft version and the Apple version. The Microsoft Windows Excel I should say has this built in statistical ability which is omitted in the Macintosh version. In the Macintosh version you actually have to download a piece of free software called from Statsoft that will do exactly the same statistical analysis as the Windows version. Now having said that, many of you, most of you will have this program on your laptop and you've never even opened it. Not knowing that you can capture data in it and by the press of a few buttons it will give beautiful p-values for you. I'm going to pause a moment and we'll go to a short demonstration of how this can work. Here then we have Microsoft Excel and Mac version. So very simply you'll see these cells. I can just highlight one of them and start typing a name. Let's call my name group A and I can hit the tab button or the arrow keys and I jump to the next cell and I can just move around in them. So let's say for instance just below it I can just click in it. I can type a value of 110, hit enter or tab. All the arrow keys it will take you in different directions and I can type the next value. So look at the left hand here. I have my two groups of patients. I've named them group 1 and group 2 and for each patient and I seem to have 31 patients in each group I can simply type the values. Let's consider these as some lever function value. Lever function test the value. So those were the values of 31 patients in group 1 and 31 patients in group 2. I can now very simply launch my statistical add-on called stat plus and here it's added. We'll see at the top. I can simply go to statistics and then look at all these statistical tests. Now some of them are not available in this free version. You'll actually have to pay for the professional version but most of what we want to accomplish here we can do with the free version and all of these as I say are already included in the Windows version of Excel. So I want basic statistics and I want to compare the means of two groups. A t-test. This will come up. I have my group 1 and I have my group 2. So I'm going to just click on here and I'm going to highlight and tell it where all these values are. So I'm going to click there. It appears and I just click on the first one. Hold the click and drag it all the way down. So I've highlighted group 1. I bring stat plus back and it's referenced all those values. Now I'm going to go to my second group and I'll highlight all of them. Click and drag. Bring back stat plus and there they are. I'm just telling the software that in the first row where it said group 1 and group 2 they are labels in the first group. So it mustn't see that first row where I've written group 1 and group 2 as values, everything under that. I don't have to fill in any of this. I'm going to choose an alpha level that is my p-value of 0.05 or 5% as my cutoff. There are different types of t-tests. There's not too concerned about these now. All I have to do is hit OK, wait for a second or two and lo en behold there's a sheet with all my answers. Let's just say view, zoom. Let's zoom in a bit so we can have a look. Beautiful statistics there. It'll tell me each of my samples, sizes of 31 patients, the mean of each of these groups and the variance which is the square of the standard deviation. Don't worry about that. And there I have a p-value 0.03 Statistical significant difference between the two groups. Just remind you there's two different types a two-tailed and a one-tail p-value. A bit later in this video I'll tell you what the difference is. But this is so empowering. You can go to your practice right now, make two groups of patients, type in the values, under group 1 and under group 2. Hit a few buttons Voila, out jumps a p-value for you. That's beautiful. Now we're going to move on to the role of a dice. Now that I've shown you how to easily get a p-value you've got to understand what this p-value really means. And we're going to start that off by the role of to die of some dice. Difficult slide to see but what I want to point out to you we have die number 1 and die number 2. Now remember, a die only has values of 1 to 6. It cannot land in any other way. It's to leave out those where it can land on an ancient some bizarre setup. It can land face up with values of 1 to 6. So look at die 1 and die 2. So I can throw a 1 and a 1, a 1 and a 2, a 1 and a 3, a 1 and a 4, a 1 and a 5, a 1 and a 6. I can throw a 2 and a 1, a 2 and a 2, et cetera. And what I'm interested in here is just the totaling. So if I throw a 1 and a 1, it totals 2. If I throw a 2 and a 4, it equals 6. So under total you'll see there you can throw values of a 2, a 3, a 4, a 5, a 6, a 7, a 3, a 4, a 5, a 6, a 7, a 8, a 4, a 5, a 6, a 7, a 8, a 9, et cetera, up to 12. Now the next column over you'll note frequency distribution. So how many times or how possible is it to roll a value of 2? Well there's only one combination that can give me a value of 2 and that's if I throw a 1 and a 1. But if I go down to a total of 7, there's actually 6 possible combinations that can give me a 7. So it's much more likely for me to throw a 7. Now if I look at this dataset, it encompasses all possibilities and probabilities. There's no ways I can throw a 1 or a 13 or a 27. So everything possible is in this dataset of mine. So whatever I throw is the dice, I'm going to land up with a value totaling between 2 and 12. Nothing else is possible. So I can work out what the probability of each is. In total we have 36 combinations and in 6 of those combinations will give me a 7. That means in 16.67% of times we will get a 7. In 2.79% I will get a value of 2. That's very easy to conceptualize. If I total all those probabilities it should be 100% and indeed there you can see the total is 100%. So I can work out beforehand what my probability is as a total as I roll these dice. Now remember it says 100% there or 2.79%. Remember % is you actually multiply something by 100. So the total is really between 0 and 1. 0 being 0%, 1 being 100%. That's how we see probability and that's the p value. So p of 0.05 actually means 5%. That's right. So number 1 on the line there frequency distribution of 2 only happens once so that will be 0.0278. That is the probability of throwing a 2. The probability of totaling a 7 is 0.1667. Now think about that. So if you were to roll 2.6's to get 12 the probability of doing that the p value of getting that result is actually 0.0279. Less than 0.05. It's actually statistically significant when you get either a double 1's or double 6's, quite interesting. So let's look at this graphically. So I have now this made little columns and the column height represents the probability of getting that value. So you can clearly see it's much easier to roll a 7 or total a 7 than it is a 2 or a 12. There is one important thing and some others as well but one important thing I want you to take from this is if I were to ask you what the probability is of rolling 7's and you look at that central column and it has a height of 16.67 or an actual fact remember that's 0.1667 and you take its width as just being a value of 1 and you think back about the equation for the area of a rectangle that is height times width so if the width is 1 and the height is 16 times 6.7 the area of that column gives me the probability so I can read from a graph I can read the geometrical area the surface area of that little bar is equal to the probability so if you had all the probabilities totaling 100% or all possibilities is included in that 0 to 1 so everything is included so the total surface area if I were to total the surface area of all these little columns they were going to get to 100% or 1.0 I can now use surface area to equate probability very important now though let's look at continuous data what if the data is not discrete now what do I mean by discrete and that's why I bring in this little slide here discrete refers to the fact that nothing in between is possible I cannot roll a 1 a 2 a 3 on each specific die I cannot roll a 1.3456794 it's impossible they come in discrete values so if you were to look strictly speaking at the number of red blood cells units transfused we usually would transfuse these in one full unit not as 1.345 units if I looked at the total milliliters maybe I could see that is continuous you have to understand the conceptual difference between discrete and continuous and you'll see that at the bottom it's forms of data now the rolling of the die was most definitely discrete and if it's discrete I can see the width of my column is having a value 1 but what if I have continuous value what if I look for instance at white cell count of course you don't get a fraction of a white cell but we do express white cell count or hb in such a values that they are for all intents and purposes continuous data sets take that for granted very quickly though I am digressing look at the two forms of data we get categorical data and numerical data categorical data two types nominal and ordinal nominal would be name a few types of cars if we were to walk outside and look at everyone's cars there will be BMW's Volkswagen's Peugeots I cannot order them in any way I cannot say that Volkswagen comes before Peugeot I might be able to say well someone likes one more one cars better than the other but I can't really put them in any kind of proper order ordinal categorical data I can I can say please mark on this little smiley face chart how much you enjoyed attending this restaurant one smiley face to five smiley faces someone might mark four smiley faces and another person might mark two there is some order to it four smiley faces mark means well you like that restaurant more than someone who marked two but I cannot say I like the restaurant twice as much as someone else if I marked four and they marked two there is no numerical real numerical difference but there is some order to that data set numerical is different in other words it's numbers interval in a ratio the only difference being is the presence of a two zero degrees Celsius for instance is interval because although we have zero degrees Celsius it's not a true zero so if the temperature is 10 degrees Celsius outside in 20 degrees I cannot really say it is now twice as warm because there is no real zero the kelvin temperature scale has an absolute zero so then I could say 200 degrees is twice as hot 200 kelvin is twice as hot is 100 kelvin but you have to convert to that scale anything else white cell count there is an absolute zero there can be a zero white cell count or a zero HB so these would be a ratio you just have to be careful there are certain mathematical equations in terms that you cannot use interval data for any ratio data so a bit of information you probably don't don't need in your everyday life what is important here is the discrete versus continuous data and here we go here is a continuous data set now go from being able to use surface area of my little bars that have got a width of one how can I now use this data set that is continuous data to use geometrical area to represent a p-value well integral calculus actually comes to the rescue because we can determine the area under a curve using integration don't have to worry about it though we won't go into any of this what is important to note here though is I don't have a tiny little base with a value of one and a tiny little height that does not exist there is no specific width so what we have to do here is to say between a certain x-axis value and another value lets determine between those two values the area under that curve so it is not an absolute value as to how high the curve goes at any one point it is the value of the surface area between two points on the x-axis for us to better understand this we need some to know something about combinations and the central limit theorem so lets have a look at this this is back to our friend the spreadsheet software and you see the function there equals common stands for combinations now think about this if I had ten patients and I wanted to make groups of two how many groups can I make well it is easy in the spreadsheet software I can decide equals common ten comma two hit enter and it will tell me 45 think about that how many little groups of two can I make now combination says if I choose John and Sally and the next time I choose Sally and John there is no difference that would be the same combination permutations are different if I choose Sally and then John and John and then Sally those are two different permutations so we talk about combinations but it is fantastical with ten people I can make 45 separate groups of two if I had to make groups of four I can make 210 completely different groups of four people just from ten people now look how that suddenly escalates if I had a hundred patients and I were to make groups of ten I have 1.7 times ten to the power 13 different combinations that I can draw from that so why is this important it is important for the following reason if you were to include 50 patients in your research study let's say 50 patients with a urinary tract infection how many people on the face of this planet can suffer from urinary tract infections millions now if there are about 7 billion people now remember let's say at risk would be let's this round things down and it's a really round down a million and from those you take 50 for your study so from a million people making groups of 50 can you imagine how many different combinations there could be that you could have included in your research that has nothing to do with the fact that you live in Cape Town in a second geographical area we are going to come to you in every one in the world involved your little group of 50 is this one tiny tiny tiny example of one of countless possibilities that could have made up that study if you decided to start that study a month later your 50 patients would have been different included a different set of patients so they are countless and this is where the beauty of statistics lie the beauty of the p-value the beauty of the central limit theorem and there we have it in one graph so what does this represent now it doesn't represent all the individual values what we representing here is a bell curve and that's what the central limit says if there were a million patients and your little group of 50 was chosen in amongst almost countless different combinations and let's say for instance you're just comparing two groups to each other the difference in some characteristic of your two groups would have been one single data point on this graph this graph represents all of those different possible studies that could have been done now remember from 10 patients and 4 combinations I can already get over 200 combinations so imagine how many they really are so your study forms a tiny little blip on that and what the central limit theorem says if by some very unnatural process we could have done all of those combinations all of those countless different combinations from those million patients we could have chosen this 50 combination that 50, that 50, that 50 a different 50, a different 50 in each of those studies we would have gotten for some characteristic a difference between two groups and if I could plot all of those differences it would almost always let's say always land up as this bell curve and if I saw that total area under that whole curve one the area there is one or 100% I can now say well the difference that I found must now lie somewhere on that curve and the difference between my subset would be so rare not a lot of studies would have found this difference that I found it would have been much more common that would be the hump on the middle to find that difference between the two than the one I found would mean your little value would fall here on the outside extremes and if we were able to calculate the area under the curve between where your line was your finding towards the outer edges that's what we normally do and that area there represents less than 0.05 you have a statistical significant difference that's phenomenal that is just mind blowing now we made an arbitrary decision we decided that if the area under the curve was less than 5% I would see that as statistically significant you could also have chosen 0.01 or 1% it does represent the area under the curve if the total area under the curve is 1 or 100% I'm going to repeat that your study of 50 patients is just one tiny possibility of countless others you could have done by choosing at a random 50 other patients if everyone could do that if all of those different combinations could have been involved in the study you did one with 50 and 50 and 50 and you plotted the differences it will be much more common to find a certain difference very common and those would be under the hump if you found a difference that wouldn't normally happen it would be very rare in amongst all of those remember when we rolled the dice it was very difficult or rare to get a double 1 or double 6 it might be very difficult you don't often find such a difference as what you found your value falls under those narrow bits the area under the curve would that dust be small and you could say I found a statistically significant difference that is mind blowing so I bring you back to to one consideration you must have and it's critical to do when you do read a journal article look at that you have a two-tail distribution p-value and a one-tail distribution p-value one is 0.06 and if we chose 0.05 as our cut or value that's not statistically significant but the one-tail distribution gives me a 0.03 and that's the statistical significant difference so if I was an unscrupulous investigator I choose the second p-value there because that'll make my paper look better that's not how it works in research you have to have chosen between those two before hand and that's how you can read a journal article from now on read how the authors constructed that paper read the method section and see if you can pick up where they might have cheated I'm not saying anyone does but you can look for it so how does it work you have to set a hypothesis before hand it's called the null hypothesis the null hypothesis might say there's no difference and the test hypothesis says one of three things the value of one group will be less than the other group will be not equal to or will be more than these are three distinct things look at this graph so these are one-tail tests so what this test hypothesis said before embarking on data collection was my one group will have a value higher than the other I choose 5% or 0.05 is my cut-off value but that means I have to group all my area under the curve not on both sides but just on the one side here it's on the extreme to the right so that would be more than if it was on the left hand side it would have been less than the first graph I showed where there were pink areas under both sides that is where the test hypothesis just said it will be different than so let's say for instance some test and you're looking at white cell count the one group will have a white cell count of 12 and you're just saying that the other group will have a value different than 12 that's a two-tail test if my test hypothesis said the test group will have a value of more than 12 this becomes one-tailed or less than this becomes one-tailed and I can actually use that one-tail p-value but I'm grouping that whole 5% under one side of the curve I don't split them in two and a half percent on one side two and a half of the other so what the soft way actually does is you'll see in the thin little green line there it works out a value and for that value we look towards the edge further away from that line what the area under the curve is so on the left hand side graph you'll see that will denote statistical significance the area under the curve of that little green bit is less than 5% because the pink represented the 5% that is a statistical significant difference but I use geometrical area I use the surface area of that little funny shape there to determine whether things are statistically significantly different or not on the right hand side though you see the line that was calculated is to the left and the green area under the curve is actually more than 5% or more than 0.05 so that finding would have been statistically insignificant some final thoughts questions in your daily practice questions in your mind think about all the data that you sit on think about embarking on a bit of research and publishing that if you have any questions please give me a call or send me some email thank you