 All right. Hello, everybody. Thank you for showing up today. It's Monica Wahee. And why are you here? You're here to see ways to extend the meaning of your data. You're here to learn about crosswalks for data classification. Thank you for coming. And as you know, I'm Monica Wahee, data scientist, epidemiologist, and biostatistician. And what we're going to talk about today is what do I even mean by crosswalks? And what I want you to do is just start by thinking about categorical variables, all right? So if you do statistics, you might think of a chi-square analysis. That's a nice bivariate categorical variable analysis. Hi, Erica. I'm so glad you showed up. All right. So we're talking about crosswalks and we're talking about categorical variables. So if you've ever done a chi-square, like I remember one day when I was working in Florida at this Alzheimer's Institute, I was using PCSAS. And I ran a chi-square between age groups and income levels. And there were several. There were like four age groups and like four income levels. And I crashed my PCSAS because, you know, what I shouldn't have had, even four was too many. But a lot of times you get data. Like, I often would get like insurance. Like, you know, you get insurance like etna, health insurance, like etna and signa and whatever. And you want to classify it into smaller categories. So that's what I'm basically talking about with crosswalks, okay? I also want to introduce to you the term cardinality, which if you're from the informatics background, then you would know this term. But we don't really learn it in public health. What cardinality means is how many, how many classifications there are of a variable or how many levels there are of a variable. So if you have a variable like systolic blood pressure, you know, which is like 110 mmHg, that's an example, that has high cardinality. That's basically a continuous variable. If you have a variable like gender, no matter how you categorize gender, it's going to be low cardinality. You're not going to have that many categories. But when you think about like health insurance or like state in the United States, you know, that could be like up to n equals 50 categories. So that's a lot of categories. So what happens when you have too many categories? You can't really do a lot with your data, right? So that's what this is about is making crosswalks. The reasons you make classification crosswalks is if you program, you've probably programmed a recoded variable before. So think of age groups. They're always recording, you know, if greater than 18 greater than equal to 18 and less than, you know, 25 make a one and you know what I mean. So I use crosswalks all the time in data warehousing. Like a lot of times like one situation you'll have is there's only so many hospitals in a region, like their physical buildings. So you'll have this list of these physical buildings, but they have different names like they change names over time. But trust me, they're the same physical building. So we'll have like so many different identifiers for that hospital, like they have different, you know, Medicare might have an identifier, there might be a local state identifier. That's where I really use crosswalks. But if you join my data science online mentoring program, and you do a portfolio project, you might just be doing a crosswalk to reduce the number of levels or the cardinality of an important categorical variable. And so I'm going to show you something that I do. And you're going to think, okay, well, I've done that before. That's basically recoding groups. But what I'm going to show you is how you sit down and design the new crosswalk, the new crosswalk variable, how you sit and you think about it and you design it. And then, then you can go program it. And I'll show you a trick, which is, well, I'll tell you about the trick, which is that if you design it and you put it in Excel, like I'm going to show you, you can kind of do some automatic stuff with coding your data. Let me just go through the steps I'm going to do. I'm going to tell you I did. Okay, so this is how you make a crosswalk literally. So a crosswalk is just a table of documentation to guide you when you make code. And I use Excel to make this. In one column, it says the old value. And in the other column, it says the new value. So you're just designing a new variable. The trick is, in the new column, there are repeats. Okay. And that shows you how you are reducing the levels of the new variable. So by designing the crosswalk before programming it, you get to make intelligent decisions using your brain, a human brain, not artificial intelligence, not some algorithm, your brain. Because guess what humans, I keep encountering this humans are like smarter than machines, you know, except for maybe playing chess or something like in certain specific domains, you can make a machine better than humans are generally better than the machines, right. So design decisions should be based on what you think will help reveal knowledge pertinent to your research question or aim. And I'm actually literally going to show you me doing that. Okay. And you can make more than one set of classifications. So if you imagine like the United States, like there's 50 states, you can make a variable that just classified whether the state was north or south of the Mason Dixon line. So it'd be like north or south, right. But you could make another crosswalk that classified like whether they were east or west of the Mississippi River. I mean, it's the same state. You just, you can make as many crosswalk variables as you want. They're free. You can always add keys like primary keys and foreign keys to help with data processing. So for example, if you had a list of cities in like a state like Massachusetts, you could create a county variable. A county crosswalk variable. And that can be a foreign key to county data. So some other county data. All right. And you can base your classifications on empirical data, like frequencies of occurrence. Like you can say, okay, I made, I did this. I crosswalked this down to four categories, because there were three big categories in a whole bunch of others. So I did that. Like you can do it based on that. But I'll show you an example and then you'll know what I'm talking about. One of the links links to it. This is an example of a portfolio project. And this is using adverse event data. So drugs are put on the market and people start taking them. And then sometimes they get sick and they think it's because of the drug. And so then they report it into this adverse event database called Ferris. And if you go here, you can learn how to query Ferris and download data from it. And one of the people in the data science mentoring program right now is looking at Ozempic, which is a drug that is supposed to be used to control diabetes, but a lot of celebrities are using it to lose weight. Now, I didn't realize this, but Ozempic is actually an injectable. Like it's not an orally taken drug. And so I started getting worried. Oh my gosh, like, here are these people who are not diabetic, they're just trying to take it to lose weight. Like, aren't they getting a lot of adverse events? So I went and I downloaded the data, which is sort of demonstrated here. So this is what the data looked like. Okay, so what I got here, I'll make it a little bigger so you can see it better. So what it did was what it gave me where I should go to the top here. It gave me this category and the number of cases that I calculate the percentage I think did I or no, it gave me the percentage. And I added this order. You know, this is just Excel because I wanted to see how many there were. So let's go look at how many there are. Oh gosh, there's an awful lot, right? It's like 26,000. But look at these adverse events. They're kind of weird. These are weird. These are rare ones. This is sort of sorted in order of frequency. Like you have X alcohol user, is that even an adverse event? You have sexual absence, menstrual clots, you know, like, are these even real? So, but I was kind of, of course, more interested in the more common ones, right? So nausea and vomiting are really common in so many drugs. Like this is usually the top thing you see in adverse event databases. Nausea, vomiting, diarrhea. But what the fourth one was like off label use. Now, what off label use means is using it for something that it hasn't been approved for the FDA for, but that's not that weird in general. Like I have migraines, and if we have trouble controlling my migraines, my doctor might prescribe something to me off label that's used for like, I don't know, some other, you know, maybe, maybe seizures or depression or other other uses, because the physician thinks it'll work. So it's not illegal to do off label use, but like, first of all, that's not an adverse event. And why is it number four? So I thought what it meant, I mean, I don't know what it means. But what I interpreted to mean is that, I mean, there's no problem with having off label use if it's, you know, supervised by a physician. But what I was assuming this means is off label use that caused a problem. And I related it. And so there was some issue with the off label use that there was a problem. Okay, which is weird, but I mean, not weird, but it's unusual. It's not the same as these, these sort of first ones, which are like kind of normal things you'd expect. And then the next one is like decreased appetite. Okay, that sounds like something weight decreased, blood glucose increase, constipation, they seem like kind of normal things you'd see with any drug, right. But then headache, I got down here and it says product use in unapproved indication. And that was only number 15. So I was like, you know, I could take these 26,000 rows and classify them into whether or not there's a use error. I could just do that. And so I created this, this new crosswalk variable. Okay. So I just wanted to do this as a demonstration to show you how you might do a portfolio project on this. I was not doing this for a peer reviewed article. If I was doing it for a peer reviewed article, I would want to go through and classify every single row all the way through 26,000. So I decided to want to keep track of rules I used when I did that. But because I was just doing demonstration, I decided I would only actually read and classify all of them down to the frequency of 40 cases, like down here. So you see, at 40 cases, I just stopped, I just got lazy, right. So this is how I kept track of my data dictionary, my coding. I said use error is adverse event appeared linked to a use error. And then two, so this is my code is one, two is adverse event does not appear linked to a use error. And then three is I'm too lazy to categorize. If I really wanted to go back and do like evidence based medicine, I could go back and like categorize all of them. And then I could get rid of three and I could really look at it but let me just show you how far I got. So what I did was I actually just read each one of these and remember one is a use error and two is not. So I included, you know, so here is product use and unapproved wrong technique and product usage process, right. And then here inappropriate schedule a product administration. So you see, like my inkling that the fact that this administration is like, like people are overusing it and they're, they have to do an intervenance administration that this is causing all these use errors. And I was thinking, yeah, this is looking pretty like much like I was right. And so you can see incorrect dose administered product communication issue. So you see all these ones and twos. And I just went through here by myself. Now, if you were like, well, you're coding this all by yourself, how do you know somebody else would agree with you. Well, you can have more than one person, you can have more than one person look at this, right. And then until you're done, in fact, if you're on a research team, it's probably a lot easier to get 26,000 rows calculated. And remember, this is our original data set is this these two variables. So theoretically, if we created this variable, I could save this as a CSV and import it into SAS or R and join it or merge it based on this category column, and just patch on this variable here. I think I did something like that in our on this page. But basically to go and tell you how this worked out is. Yeah, I did something in our here and so you that makes it so it's not that hard to play it's not like you're coding age groups or, you know, greater than equal to 18 to less than 25 equals one you wouldn't have to do that you could just import this do that join this is how you've got your category but so over here. I just want to show you what I did so I was lazy right and I didn't classify all. So if you look at the sort of violet colored one, that's the unclassified ones that's the ones I gave up on, and that's 23%. So that's about a quarter of this data set so imagine this is unknown right now. Okay, but what is known is that 68% are for sure not use errors, but 10% are for sure use errors. So this piece. This is either going to get it's either going to stay 10% or get some of this piece. Okay, that's a lot. Like if, if between 10 and 20% were use errors in this fairs database about Osemic, I think I'm like the first person to figure that out. Right. And so, and what's great about it is if you know people started arguing with me or whatever, they can just look at my crosswalk, and, and figure out what I did. One point that I know somebody's going to make is that why don't you classify them automatically or whatever but how would you do that in this case, right like you literally do have to read it like you literally do have to use human intelligence. And so like I'm just going to give you an example of like why you think of crosswalk variables. And why you would make it in Excel in the goal is to keep documentation about why you made the classification decisions you did. So you could use this map here to classify 50 states into just a few regions or divisions right like see West Midwest whatever. But let's say you're comparing states based on something else like marijuana laws, or car laws, you know, car accident laws or building codes or demographic populations or whatever. You could take these 50 states, I mean that's only 50 rows and go through and use your own classification system. And a lot of people like never think about this. And so I'm like, like they go, Oh, you can just do that. And if you think about it, if you ever ask a question on a survey that has an other as an answer, and people start putting stuff in other. Eventually, if you're going to analyze that if they keep saying if they say the same thing and other more than once you want to want to turn that into a level of your categorical variable. And so this is just an extended version of doing that. Now, I just was talking about the fairs that adverse event online dashboard where I downloaded the data. And I was talking about Excel using to classify the data and about using our and SAS to analyze the data. And all of these things are applications. And to be honest, what we're focusing on today was trying to use data downloaded from this fairs application to figure out, like, what could I do with these data, right? That's not normally what happens in research. Like if you design a research protocol, because I'm an epidemiologist, you know, we design like, like data collection, like if I wanted to study adverse events in people taking ozemic, I'd like design this really good study. I mean, I would have these crazy classifications like sexual abstinence and alcohol use or whatever they were saying. Okay. And so that's the problem. If you're expected to analyze data from an application. Like you get data from something like this, this dashboard, you're gonna have to figure it out. Right. And so I'm holding this workshop called application basics and our theme this month is the promise of open source. And the learning objective is to understand data sets from applications well enough to analyze them and produce results. And today I taught you one of my tricks. I have many of them. And if you come to the workshop, you'll learn about computer applications. You'll learn design approaches for computer applications like that dashboard. The team structure, how data are stored in the application terminology, use an application development. And with this knowledge, you can break through communication barriers to get the answers you need to complete your analysis and be seen as an expert. So here are the details of the workshop. Again, it's called application basics, the promise of open source. I'm holding it Saturday and Sunday, February 24 and 25. Each session starts at 12pm Eastern time and runs about three hours. So there's three hours one day three hours the next day. And the normal price for a workshop like this is about $250 to $750 per workshop, because I was pricing some weekend workshops. But because you came today to this event, your cost is free. I would love to have you there. Great. Well, thank you so much for coming today to our event. And please make sure that you sign up for the workshop if you're interested. And I hope you have a good week and I hope to see you at my next events or at the workshop.