 I'll jump into it then. So I'm Carl. I work on the user experience design team, on the research sub team. And I'm going to be talking about benchmarking usability and OpenShift. This is a project that we did a couple of years ago. And I'll go over kind of how benchmarking usability works in general first, and then give a walk through what we actually ended up doing. So yeah, let's first talk about what benchmarking is. This might be a familiar concept. We'll get a nicely specific definition of usability so we can figure out how to actually benchmark that. Why we might actually want to do this in the first place, because it can be a big undertaking. And then, like I mentioned, we'll go over how we actually implemented this on the user experience design team. So first of all, what is a benchmark? You can just look this up on Wikipedia. It's any metric or measurement that is consistent that you can apply over and over again. So this could be a CPU performance in an application, manufacturing accuracy, or how much it costs to make any kind of unit that you're making. And then benchmarking is just how we get that measurement. So this could be imposing a workload on a system to see how that CPU fares, the number of defects that come out of a manufacturing process, or a unit cost analysis. So what is usability? We need to, this is obviously a little less tangible, so we need to zoom in on this to understand how we can actually go about measuring it. So broadly speaking, usability is just anybody's quality of experience when interacting with a product or a system. In order to measure this, though, we look for a little more defined way of talking about it. So we go to the International Organization for Standardization Definition, the ISO 9241-11. So this gets referenced a lot in the user experience world that people want to be real technical on how they're talking about usability. But it essentially just has three main pillars that hold it up. The first is effectiveness. And this is how a user is able to complete their goal in the software or the system. And efficiency, which is how fast they can complete that goal. And then finally, and a little more subjectively, is satisfaction. Do they feel good about that process that they just hopefully completed in a good amount of time? So that's how we're defining usability for this benchmarking process. So now that we know what benchmarking is, we know what usability is, how can we actually benchmark it? So we'll go over, again, why we would want to do this in the first place, a few variations and features that you can make if you choose to go do a process similar to this and how we put it all into action. So simply put, we just want to be sure that we're improving. With design work, it's not always as obvious as with a computing system or even a physical product to be sure that we're moving in the right direction. So this benchmarking process lets us be sure that the design changes that we're making are moving the needle in the right direction. So we can get some sort of quantitative and empirical evidence that we're doing what we want to be doing. We can see what is changing and how much it is changing. So this is pretty much as simple as measuring wherever you're at, making some kind of change to the product or system, and then measuring that again, comparing those measurements. And this is the core of benchmarking usability. There are some other side pieces to it as well. Like you can get qualitative information that is a numerical metric, but can dig into the why something changed. So maybe users complete, they add a project to OpenShift twice as fast, but qualitative information lets us know why that's actually changing what's driving those quantitative metrics. And what we did was measure a version compared to an earlier version. You can also compare to a competitor product if they have a very similar space in a set of user goals. You can compare them to each other. Or you can also just compare to an industry standard if you only have one version available for testing. So there are a few types of usability benchmarking. There's behavioral and there's retrospective. And these are two big buckets that we can categorize them into. So behavioral, we're actually watching participants attempt a goal in the software while they're being recorded and we observe their behaviors or anything that happens while they're working with the software. So again, this is focused on behavior and we're measuring the user actions as they move through the actual software. And this is different from retrospective where we have participants recall their recent experience with the product. This is more focused on their memory of the product, their attitude, their feeling towards that software. So this is often done just through questionnaires or just verbally asking the person. So these can also be combined. You don't just have to choose one or the other. And I would argue that they should be combined because people's feelings and their behaviors are often kind of split apart. I've seen people just miserably fail a task that was very hard to do and they shouldn't have been able to complete it. They rated five out of five on usability. Whether they wanna please the researcher that's kind of helping them out with this or they just don't wanna look like they haven't failed. It's important if you're gonna capture that retrospective information to see what they're actually doing because they're not often the same thing. And levels of usability benchmarking is how zoomed in you are on this measurement. So we have task level and product level which are kind of self-explanatory. So for the task level, we're getting a benchmark for each task. So say for OpenShift, this is add a project, change a configuration file, add a new user group, whatever that individual task might be. So again, we have to figure out what the users actually do here. We need to figure out what those top tasks are. And when I say top task, you can do a million things in OpenShift interface. So you gotta figure out which ones you wanna test because you can't obviously test every single thing that you could possibly do. And this is generally done with behavior. You could just ask people, again, with that questionnaire about how they felt about each task. But if you're gonna have them do the task anyway, you might as well take those behavioral metrics. And for the product level, this is where we're just looking at the product overall. So say we're just asking them how they feel about OpenShift in general, not a specific task within it. And this is most often done with a retrospective questionnaire email survey. You might have seen something like the Net Promoter score which is different than benchmarking, but where essentially you can email people just ask how easy was this product to use and that'll give you a real high level benchmark, but it is still a benchmark that you can use. And again, these can be combined, especially if you have people going through the tasks, you can ask them how they feel about the product overall and that gives you a little more information to triangulate. And then getting to the actual metrics. These are most often split into behavioral and retrospective. So kind of the gold standard metric for behavioral metrics in usability is completion rate. If the user can't complete their goal, nothing else really matters. It doesn't matter how long it takes them if they're not completing it. If they're satisfied and they haven't completed it, something's clearly not connecting there. So this is a really common one to measure. It's rough, but it's a very good metric to capture. A little more fine grained is error rates. You had to be able to know what every step is if it's an error or not an error. So this can take a bit more work, but it gives a little more insight into maybe where the process is breaking down when they try to complete their goal. And then time on task is how quickly they hopefully complete their task. For retrospective, again, these are pretty much questionnaires. A lot of them are product level, like the system usability scale or the SUS as it's often referred to. It's 10 items and it gets at essentially how learnable your system is and how usable it is. So you have people fill this out after they've completed a bunch of tasks and you wanna know how they feel about the product in general. Or again, you could just email it out. Now we have a newer scale called the UMUX Lite, which is just two items and captures essentially the same stuff as the system usability scale. So if you only have time for two questions or you don't expect that people are gonna wanna sit through 10 questions, we have some shorter options now. And then the shortest of all is single ease of use question. How easy was this to complete or how easy is this product to use? So you can use that on tasks or for a product entirely. And the key here is just choose something that already exists because it's tempting to write your own scale, but wording is very hard on these scales even though it might seem intuitive to write. And if you use something that already exists, you can be sure that people have validated it. And if you use something that already exists, you can compare to sort of an industry standard. So if you get a four out of five on the scale, you've just made, you don't really know what that means. If you use the system usability scale, you can know exactly how that stacks up to other industries and other products that are out there and have measured with that. And then the final piece before we get into more specifics is moderated and unmoderated. So for moderated researchers are actively involved in the process, they're guiding users through, they're fielding questions, prompting some more qualitative why information about what they're thinking. And then you probably have a note taker as well. This can be done in a lot of places when I say lab setting, it could be a big mirror that people are looking through on the other side. It could just be a conference room. It could be ethnographic in the participant's work location, although that's a little bit harder to swing, but can be really interesting if you can do that. And with OpenShift, we did it all remotely with software products especially, that's easy enough to do. So this is a lot more expensive, that a lot more people involved, it takes a lot more time, but you get a ton more data. So that's sort of the trade off there. You get all that qualitative information, but it does cost a lot more. As opposed to unmoderated where participants are just completing it on their own time and there's no researcher involved in that moment. So this location could be anywhere. If that matters for your product, you don't have control over that. So if that does matter, then maybe unmoderated testing isn't a great option. And this is far less expensive than the moderated testing, but does come with more concerns like you don't know where people are necessarily doing the testing. People also tend to take longer, like if someone's adding a project in OpenShift and they get really thirsty, get a glass of water, come back, that's gonna be in the data. Obviously not part of adding a project in OpenShift, but that will still be in there. So there's more noise to work around. You also need to be sure that your directions are extremely clear because you can't really guide them back if they get off track, like you would be able to if you were there leading them through the whole process. So those are some high level features that you can choose if you're going through a benchmarking process. And then you have to put it all together. You have to get a plan and I'll go through how that would actually come together and what we did along the way. So you'll see mostly in the red text kind of what we did at each point. So for a little bit of background, our team wanted to get this empirical measurement of our design impact on OpenShift. So how much of a change had we created in the product? And hopefully that change was good. That's what we set out to prove. But if it wasn't that, we would have figured that out too, which would have also been useful information. So we started with version 3.5, which was before a major UXD effort that had taken place. And then later on we followed up with version 3.11 after we had done some major user experience implementation in the design so we could compare this usability between the versions. So essentially the research question or hypothesis is that if we can demonstrate this higher usability on 3.11, this would be good evidence for a positive impact of our efforts on the product's usability. So you're gonna see that at the top and they'll turn in red as we kind of go through each phase in the process. So participants, most people don't really think about this aspect. After doing it once, it's about all I think about now. This is the hardest part in the process. Start this as early as you can. It takes the longest. You have to figure out first who the right users are. And I'll give you one hint, it's not you. You're not the user. You might see it on my computer sticker there. Even though it's tempting to pick yourself or someone on your team, because they're right there, anyone that develops the product is a lot, they're too close to the product to kind of sit in the seat of the user. They know too well how it works and they understand exactly how it's architected together. So you wanna find an actual set of end users. It could be someone in your company if they don't develop the product at all but even then, their understanding is probably gonna be a little different than an actual end user. So for us, OpenShift has two broad personas, developers and admins, sysadmins. You can dig in a little more there but we essentially wanted a spread of external developers and admins to run through all of the tasks that we put together for this testing. And then this is the part that really takes a long time is finding those users. If it's something like consumer focused, you can kind of pick people off the street. It's not that hard. If you have a more expert set of users, like OpenShift, it can take a while and you have to get creative with how you actually find those people. So it can be through one of the many websites that exist now like userinterviews.com where they sort of set up a panel and they scan through LinkedIn to find people. But if you think people are gonna respond to a Twitter hashtag, there's nothing wrong with that. If it gets the people to you, then that's totally fine. And sometimes you do have to get creative in how you recruit for this kind of project. So we chose just about every avenue that we could. Check one, two, one, two, three, two, three, two, three. Check one, two, two, one, two, three. Check one, two, one, two, three. Check one, three, four. Check two, check one. Check, check one, two, three, four. One, two, three, four. Check, check one, two, three. One, two, three, four. Check one, two, three. check there's htmi and vj input so if you want to plug in your laptop that's going to close the software you want to use yours this audio will go to the recording so if there's anything anything on there I'll just go in gosh yeah how do I test it test can you hear me don't need to get higher up group six channel four oh here I'll try and adjust this here second okay yeah this is this is the volume that it's at how does this sound this okay sounds good three weeks I moved back to Minneapolis where I'm originally from I'm back in the Midwest yep the UX team I think it's pretty split between Westford and back so you might know a few UXD folks okay what do you work on yeah I know we have a rack in for a new research hire that's gonna work just with QE I'm essentially get a little bit about what I'm talking about built into some QE processes so yeah I've worked with QE a little bit mostly when OpenShift IO was still being developed I can take it away yeah yeah can everyone hear me okay all right awesome well jump into it then so I'm Carl I work on the user experience design team on the research sub team and I'm gonna be talking about benchmarking usability and OpenShift this is a project that we did a couple years ago and I'll go over kind of how benchmarking usability works in general first and then give a walk through what we actually ended up doing so yeah well first I will what benchmarking is this might be a familiar concept we'll get a nicely specific definition of usability so we can figure out how to actually benchmark that why we might actually want to do this in the first place because it can be a big undertaking and then like I mentioned we'll go over how we actually implemented this on the user experience design team so first of all what is a benchmark you can just look this up on Wikipedia it's any metric or measurement that is consistent that you can apply over and over again so this could be a CPU performance in an application manufacturing accuracy or how much it takes to how much it costs to make any kind of unit that you're making and then benchmarking is just how we get that measurement so this could be imposing a workload on a system to see how that CPU fares the number of defects that come out of a manufacturing process or a unit cost analysis so what is usability we need to this is obviously a little less tangible so we need to zoom in on this to understand how we can actually go about measuring it so broadly speaking usability is just anybody's quality of experience when interacting with a product or a system in order to measure this though we look for a little more defined way of talking about it so we go to the international organization for standardizations definition the ISO 9241-11 so this gets referenced a lot in the user experience world that people want to be real technical and how they're talking about usability but it essentially just has three main pillars that hold it up the first is effectiveness and this is how a user is able to complete their goal in the software in the system and efficiency which is how fast they can complete that goal and then finally and a little more subjectively is satisfaction do they feel good about that process that they just hopefully completed in a good amount of time so that's how we're defining usability for this benchmarking process so now that we know what benchmarking is we know what usability is how can we actually benchmark it so we'll go over again why we would want to do this in the first place a few variations and features that you can make if you choose to go do a process similar to this and how we put it all into action so simply put we just want to be sure that we're improving with design work it's not always as obvious as with a computing system or even a physical product to be sure that we're moving in the right direction so this benchmarking process lets us be sure that the design changes that we're making are changing the need and moving the needle in the right direction so we can get some sort of quantitative and empirical evidence that we're doing what we want to be doing we can see what is changing and how much it is changing so this is pretty much as simple as measuring wherever you're at making some kind of change to the product or system and then measuring that again comparing those measurements and this is the core of benchmarking usability there are some other side pieces to it as well like you can get qualitative information that isn't a numerical metric but can dig into the why something changed so maybe users complete you know they add a project to open shift twice as fast but qualitative information lets us know why that's actually changing what's driving those quantitative metrics and well what we did was measure a version compared to an earlier version you can also compare to a competitor product if they you know have a very similar space in a set of user goals you can compare them to each other or you can also just compare to sort of an industry standard if you only have one version available for testing so there are a few types of usability benchmarking there's behavioral and there's retrospective and these are two big buckets that we can categorize them into so behavioral we're actually watching participants attempt the goal in the software while they're being recorded and we observe their behaviors or anything that happens while they're working with the software so again this is focused on behavior and we're measuring the user actions as they move through the the actual software and this is different from retrospective where we have participants recall their recent experience with the product this is more focused on their memory of the product their attitude their feeling towards that software so this is often done just through questionnaires or just verbally asking the person so these can also be combined you don't just have to choose one or the other and i would argue that they should be combined because people's feelings and their behaviors are often kind of split apart i've seen people just miserably fail a task that was very hard to do and they shouldn't have been able to complete it they rated five out of five on usability you know whether they want to please the researcher that's kind of helping them out with this or they just you know don't want to look like they haven't failed it's important if you're going to capture that retrospective information to see what they're actually doing because they're not often the same thing and levels of usability benchmarking is how