 Hey everyone, welcome back. It's Professor Howard. So we're moving into topic number five, which is one of my favorite topics in all of behavioral analysis. Keep in mind that so far we've been talking about developing a behavioral definition, choosing a method of observation, figuring out which approach to observation we're going to use to maximize the quality of the data that we're going to receive. And when we're talking about having a solid intervention, you want to take into account the quality of your observations. You want to make sure that you're choosing the correct goal and that you're meeting the needs of your client. And then finally, we want to talk about this idea of fidelity. Now notice that in your reading, there's nothing about fidelity. And that's because this is a relatively newer idea. At the time when Keith Miller was writing his textbook, there wasn't really a lot of effort being put into making sure that behavioral interventions were being used as they were intended. And really at that point, we were just trying to kind of discover what principles worked best and under what circumstances. Now we understand that it's really critical that if you're going to use a procedure that you use it well, and if not, then you're not going to receive the results. You're not going to achieve the results that you need to for your client. So this is a relatively short lecture for this week, but understand that these topics are some of the most important topics in the entire field. Now I want to talk first about reliability. Notice that in reliability, what we mean is the extent to which two or more trained observers agree that our behavioral phenomenon is occurring, right? This is a science. We need to have confidence in our data, data, data, data, data. Actually, this would be a great time for you to start that coffee drinking game. Every time I say the word data, take a sip of your coffee. We have to have confidence in those data so that we are sure that what we're observing is the phenomenon of interest and that we're accurate, at least to the extent that we can calibrate two humans to see the same thing. So in order to make this happen, you want to make sure that you're comparing those two trained observers. We look at the extent to which they agree with one another, and then we divide it by the number of opportunities that they had to agree with one another. Miller's definition is 100 times the amount of agreement or number of agreements divided by agreements plus disagreements multiplied by 100. So imagine that you and I, our trained observers, are watching a client for 10 sessions. You and I agree that the behavior happened in nine of those sessions, and we disagree that the behavior happened in one session. So it's the number of times we agree nine divided by the number of times we could have agreed agreements plus disagreements with 10. Nine divided by 10 is 0.9. We multiply it by 100 to create a percentage, 90% agreement. Now, in the past, reliability observations were conducted at least once per experimental condition, and of course we haven't yet gotten to what talking about conditions are yet, but bear in mind that it was performed pretty infrequently, but in more contemporary research and more recent work that we say you have to be conducting these observations about a third of the time. So every three sessions, you really should be having someone introduce some quality evaluation. And Miller makes a distinction. Miller has a rule of thumb or a heuristic about a new definition and an old definition. Miller says that if you're developing a brand new behavioral definition, you're developing a brand new intervention for a client that 80% is pretty acceptable reliability or inner observer agreement because bear in mind it's new. You're going to have to do some work with it. It's probably less refined and you haven't really discovered all the things that are going to challenge that definition yet. But once you've used that definition, once you've done a study or you've done an intervention and you know that that definition has merit and you've refined it and you've gotten it pretty strong with an old definition, one that you've used before, you got to step up your game and make it a little bit better. So that's why Miller says a new definition. Give yourself some slack. 80% or better is usually the benchmark we're aiming for. If it's an older definition, 90% is fine. But in practice, like higher is better to a point higher is better up until you start seeing that a person is saying that they have 100% reliability and look, human observers are flawed. We're human. So if a person tells me that they had 100% reliability across all of their sessions, I'm a little bit skeptical. The whole point of reliability is to make sure that our human observers are doing their jobs right, that they're using the behavioral definition as they've been trained to and they're accurately capturing those data, at least to the extent that they agree with each other. We can never be certain that they actually recorded the behavior perfectly. We just want to see to the extent to which they agree with the other trained observer. But if they tell me that they had 100% reliability across all of their observations and there were lots of observations and they never made a mistake, I'm skeptical. As a rule of thumb, higher is better. Remember, to observe reliability, Miller says you need to be observing the same behavior at the same time using the same behavioral definition. When we say the same behavior, that means that you're both looking at the same client, that you and the observer number two are making sure that you're looking at the same person and that you're recording the same behavior of interest. If you have a client with multiple skills and I come in thinking, okay, I'm going to do reliability on social interaction with their colleague and someone else is looking at the extent to which they initiate some chore, then we're not looking at the same behavior. You want to make sure that you're looking at the same time. So reliability observations must have two trained observers looking at the same behavior at the same time. If you record on Monday and I record on Tuesday and then we try to compare our observations, we just can't because we weren't both looking at the same thing. At that point, we're just collecting data. We're not doing reliability. And then finally, we have to make sure that we're using the same behavioral definition. So implicit within that, again, we're looking at the same behavior, but then the same behavioral definition requires not only that we're looking at the same response, but training. There's an element of training there. So are both of your observers actually trained to use the behavioral definition? Did you invest enough to make sure that you're getting really good data? Because we need to have that good data. Okay, there are two different ways in which we conduct reliability. When we're talking about time-based measurements, we typically look at something called point by point reliability. And what I'm showing you on screen here is a table. In that table, we have two observers, one observer on top, one observer on the bottom. And I'm showing you with ones and zeros whether or not those observers said the behavior happened or didn't happen. So in the first interval for these observers, observer one said it happened and observer two also said it happened, which means that we have an agreement. In session two, both observers said the behavior did not happen. So even though they're both saying no, the observers agreed with one another insofar as they marked the same thing, right? So they either have to both say it happened or both say it didn't happen to have an agreement. In the third interval, one observer said the behavior happened while the other did not. So we have a disagreement there. In interval four, one observer said it happened, the other did not. So we have a disagreement. In interval five, both observers said the behavior did not happen. So we have agreement. In interval six, they both said the behavior happened, agreement. Seven, both observers said it didn't happen, an agreement. In interval eight, one said it did happen. The other said it did not happen, a disagreement. In interval nine, both said it did not happen. So it's an agreement. And in interval 10, both observers said the behavior happened. So we have an agreement. So we mathematically count this out. We see how many times do they agree. In interval one, two, five, six, seven, nine, and ten. So seven instances of agreement. And then how many times did they disagree? How many times did one observer say the behavior was happening and the other did not? One, two, three, or rather in intervals one, intervals four, three, and eight. So we have seven intervals of agreement. We have three intervals of disagreement. And we calculate that by taking the intervals of agreement seven, dividing that by agreements plus disagreements, or seven divided by seven plus three, agreements plus disagreements, seven over 10 is 0.7. And then we multiply that by 100 to get their reliability, which is 70%. And Miller would say for this, that is unacceptable. It's not enough agreement. Bear in mind, point by point reliability is typically used for interval and time sample recording. For event recording and outcome recording, we would use something called total count reliability. And so this would be where, for instance, we take the total that one person counted and we divide it by the total that the other person counted. Because implicitly within that, that the smaller number that one observer received is going to be the number of times that they agreed at a minimum. And the difference between those, like in this case, we have observer one who said the behavior happened 10 times and observer two said the behavior happened 12 times. They agree on at least 10 of those. And the difference between that 10 and 12 was two. So then mathematically, if you take 10, their number of agreements, divide that by 10 plus two or agreements plus disagreements, you really just end up with smaller number divided by larger number or 10 divided by 12. And you get 83%. If this were a new definition, we would say that this is acceptable. We use total count reliability for event and outcome recording. We use point by point reliability when we're using interval and time sample recording. Okay. So you cannot use the terms reliability and accuracy interchangeably because remember, accuracy is the extent to which we have the true value of behavior recorded there. And we talked in unit four or in topic four about how some methods of observation are wildly inappropriate or wildly unsuited for method or for the behavior that we want to measure. So it could be that your observers were doing perfectly. They consistently applied that definition. But because we chose a really poor method of observation that there's a huge discrepancy in how much the behavior actually happened, the true value, and the extent to which the observers recorded that it happened. So the observers, if they're recording the same thing with one another, they're recording with reliability. But if they're using a really terrible method of observation, one that we gave them and we trained them to use, they could have perfect agreement with one another, but their observations could be totally inaccurate for the true value of behavior. So this is one place where I disagree with Miller. Miller says that reliability is a measure of accuracy and that's false. Reliability is the extent to which your definition was consistently applied to the behavior of interest. So I want you to be aware of that distinction. Moving on, let's talk about social validity. When we're talking about changing behavior, we have to bear in mind that we are accountable to the people whose behavior is being changed. So social validity really wants to ask the question is, are we choosing it well or are we changing the behavior well? And Miller defines this as the correlation between an expert judge and our behavioral observations. But really in English, what this means is, did we change the behavior in the right way? Would a person who's really knowledgeable about this say, yes, you definitely hit the goal, you hit the mark that was exactly what we had in mind. And I'm putting error quotes here around experts because that doesn't necessarily have to be someone who has an advanced degree or is highly trained. If you're changing the behavior for the client and the client tells you, that's not really the goal that I had in mind, they are an expert. They're an expert of their own experience. So we want to make sure that there's a high correspondence or a high agreement between what we change and what the client or people around them or stakeholders or experts say should be changed. So social validity is that correlation between what the expert says should have happened and what we see actually happened with our data. Moving on, this means that we want to make sure that we have that good alignment and it really helps make sure that we're choosing the right behavior that we changed in the right way. Okay, let me give you an example. So man and we have a problem. We're called in because we have to solve something specific. So we say, okay, we're going to develop a behavioral definition. This is the specific target behavior we're going to work on. And then we're going to use our behavioral definition from there. This is what we're going to record, so on and so forth. Now, if your question is about how well or how consistently was that behavioral definition applied, you're doing reliability. If you are asking the extent to which we actually solve the problem, you're looking at social validity. Okay, so breaking down social validity, we really want to be asking a few different questions. There are going to be three major aims here. First, with our intervention, did we produce what society wants? So for instance, did we make a meaningful change in the behavior that's going to help our client kind of fit better into their environment, into their context? Again, we got those social and habilitative goals, we want to reduce the amount of aversive control a person receives. So did we choose an intervention? Did we choose a goal that helps the person live a better life? The next question we're going to ask is, did the procedure that we choose to change the behavior, was that considered acceptable? For instance, if you give me a car battery, I can change any behavior, but most people would agree that it's really inappropriate to follow people around with some paddles and threaten to shock them if they make any mistakes. So just because you can change behavior using, you know, behavioral science doesn't mean you necessarily should. We want to make sure that our procedures are acceptable, especially given the nature of the problem. And then third, we want to make sure that we chose a meaningful, we produced a meaningful effect. If I invest heavily in helping Jeff do better in class, but Jeff's grade goes from a D minus to a D, that's not really helping out Jeff very much. But if I were able to take that grade from a D minus to a B, and they're doing much better in our classes, I'm definitely producing a larger, more meaningful effect. Social validity has these three elements. We want to make sure that we're honoring those. And so when we're figuring out who do we ask, there's a lot of people who come to mind really easily. First of all, the client. The client is always a really important person to ask. I do a lot of staff training research or student research, educational interventions. So of course I'm going to ask people about their experience. You can ask people around the client, family members, loved ones, stakeholders. You can ask experts, people who are super well qualified in that field to see what they think about it as well. There are a few people that you probably should not ask for social validity ratings. And that would be anyone who knows the behavioral definition, because remember, the whole goal of this is to assess the extent to which you're meeting the problem, you're solving the problem and you're helping the client out, not the extent to which you applied the behavioral definition. And even if you were to ask a person who knows the behavioral definition for their feedback, you cannot count on the fact that they wouldn't be biased. Knowing that behavioral definition is going to change what we observe. It's going to change how we look at that person. And so even if you think that that person who knows your definition could be impartial, I promise they cannot be. So let's return to that diagram. Remember, reliability focus is on the extent to which the behavioral definition was consistently used by the observers consistently and correctly. But social validity is really the extent to which we're doing meaningful work. We're choosing the right behavior to change and we're changing it in a way that is acceptable to the client and the people around the client. If your question, if your social validity question has anything to do with the behavioral definition, then that's a bad social validity question. That's about the definition, not about the problem. Let me give you a couple of examples. So if we're talking about Jimmy, Jimmy's boss has really fed up with them. He says that Jimmy's a lazy employee, but it turns out that the real behavior of interest here was that Jimmy's always late for work. So if I'm developing a behavioral definition, I'm probably going to focus on increasing Jimmy arriving to work on time. But when I go back to social validity, I'm going to go to the boss and I'm going to say, hey, how good of an employee is Jimmy? I'm not going to say anything about the time that Jimmy arrives. I'm going to have my reliability observers look at what time Jimmy's clocking in. Instead, when I'm going to have my social validity raider, the stakeholder, in this case the boss, what I'm going to ask them about is the goal, which was this is a crap employee, how can we make it better? So the behavioral definition is what time are they clocking in? We want them to clock in before this period of time, but my social validity rating is how about the quality of that employee? Don't say anything about clocking in and clocking out. If it truly was clocking in and clocking out late, then fixing that the time they clock in should solve all the other problems. Another one could be imagining that a young person, this student Yolanda is really talkative during class time and they interrupt peers while they're working and they contribute to some other kids in class being off task. My behavioral definition probably is going to be one of those time-based measurement systems, maybe partial interval recording to see how often Yolanda is talking during class, and maybe I want to be more advanced and do a per opportunity, but we'll talk about that in Psych 400. But essentially my behavioral definition is going to talk about, it's going to evaluate the extent to which Yolanda is talking a lot. Now my social validity rating could be how good of a student is she, how on task is she, how are things going for her academically? I'm not going to ask anything about my behavioral definition, which is whether or not or how much is Yolanda talking. Does that make sense? Okay, moving on. The last thing that I want to leave you guys with is fidelity. And when we talk about fidelity, this is sometimes called treatment integrity. It's the question of whether or not treatment was used as intended. So a question that I start here with is, how can treatment be effective if it was used incorrectly or if it was used in completely? Kind of like taking an antibiotic, your doctor tells you that you have to take the whole course of antibiotics. Don't start, don't stop taking a pill early, even if you feel better. What they're saying there is you have to use this medication with fidelity because there's some unintended consequences if you don't. And treatment integrity or treatment fidelity is the extent to which the researcher or the professional correctly implemented the treatment plan. Relatively new research is being conducted here so many experimental journals don't even require researchers to publish data on the fidelity of the intervention, meaning the extent to which the independent variable had been applied, although they may ask for data on the dependent variable or the reliability of the observation. But fidelity is super, super important. Do bear in mind that if it's worth developing a behavioral definition is absolutely worth looking at fidelity data. Reliability focuses on the extent to which we agree on the dependent variable, but really also be looking at is your independent variable, is your treatment being implemented in the way that it should be? If not, how can you possibly hope to make any progress with your client? What's worse is if it's not implemented with fidelity, you may find that in many cases we go to more intrusive and more invasive methods of treatment. So you really want to get this right at the lowest level of intrusiveness for the client so that we can meet those goals without having to go and become more aversive, become more controlling in their environment. So in this lecture, we've talked about reliability and how to conduct those for the different methods of observation. We've talked about social validity and what it's good for, which I promise it's really, really important. And finally, fidelity, the extent to which you're actually implementing the treatment as you intended. Let me know if you guys have any questions and I'll see you next time.