 This video is part of the Public Health to Data Science rebrand program. All right, hello and welcome everybody to today's workshop. And actually, let me share my screen so we can see what we're talking about, which is how to find data for portfolio projects. So that's what we're going to be doing today. I'm going to be talking about strategies and then I'm going to be demonstrating strategies. And then we're going to take time to do those strategies and then see what we found. All right. So, so of course this changes every minute this presentation like ever since I started trying to find data on the internet. You know, everything changed like when I first started trying to find data on the internet, you know, like the internet was invented and you couldn't. First there weren't any search engines so finally they invented search engines, but you couldn't really like search a database you know how like you log online and you go to your bank and you look up your like like you know what your account's going well you couldn't do that. So if there were if you couldn't do that then how are you going to get data right. So one of the first things I started looking for was I knew back in the 90s that you could request data tapes from the CDC with study data on them. I knew you could request data from Medicaid. I knew about these things, you know, so I started looking like, is there a place to download it online, you know, well fast forward things have really changed. And, and so now I've sort of boiled it down to three general strategies and that's what I'm going to talk about today. So the focus of today's workshop is how to find data on the internet that you can use for portfolio projects, meaning you can use this data on the internet, you know from the internet and you can do a project in it and you can post it on WordPress or something. And it'll look good for you and it won't get you in trouble and it won't be a bad idea, right. But it's not really that easy to do so I'm going to show you my strategies. And one of the things that you want to do is budget a lot of time, because you're going to go down some rabbit holes so to speak because even well documented data it's hard to sort of shop for it so you'll see it's very time consuming. But also, sometimes you don't find what you want, but you find information that you find the raw information that you could turn into data and so I'm going to explain what happens if you, what you can do if you do that. And then mainly what's really important is if you're like planning for your like master's thesis or dissertation or you're going to write a paper, and it's not going to go away it's a long term plan you've got, you've got to like sort of keep track of what you find. You choose not to use because you're going to keep finding it again if you don't write down why didn't I use this or what is this and also, sometimes the same data set can be hosted on multiple platforms so you want to make sure that just because you're searching in a different place and you find a data set it's not the same when you found before and you didn't want. All right, so let's see here. So the first thing you want to do for your portfolio project is decide roughly what kind of data you want to use and Mika's here, and we had talked about using what we're calling pharmacy claims data. Okay, and I'll get a little into that why I'm saying pharmacy claims. What. Beth, what we wanted Beth to use and I'm going to, I'm, you know, she'll probably watch this and I'm going to sort of try to look for today if I have time is Beth is analyzing concentrations of some viruses in wastewater data. Now if you think about wastewater wastewater is not a person. It doesn't have PHI. So I was thinking, okay, I mean what Beth's challenge with is that wastewater data has a lot of continuous variables in it. And you know in epidemiology we don't always use continuous variables and then also wastewater data has all this correlated variables. It's almost the same as if you take someone's labs, or a whole bunch of people's labs to do a complete metabolic panel. All their labs are going to be correlated like sick people are going to have bad labs and healthy people are good labs so wastewater, you know contaminated wastewater is going to have a lot of virus all different viruses, high concentration and not so contaminated wastewater is going to be have less of it. So what we, you know, she's working at that at her work so I was like thinking, okay. If I found a data set for Beth, I'd want one with a bunch of continuous variables that are probably correlated. And I'd really want it to be about wastewater. But if it wasn't about wastewater it could be kind of similar. So I could have these continuous variables so we can practice that. So that's how you kind of want to think about portfolio projects, or let's say that Beth doesn't find or we don't find wastewater data. But we find data about wastewater, like not exactly the same data but something that would inform whatever she's doing, then maybe she should shift her portfolio project to that data set what we'll just see. So what I so I just talked about with Beth was her scenario. And then I talked a little about data set requirements. Okay. And so I made a scenario on the slide, this is sort of a half true scenario I have a customer who's interested in this topic. He decided to make a scenario around it so imagine that Abe is a nurse, and he is concerned about his patients with type two diabetes and he's in the US. And he's concerned that those with severe cases are not able to access injectable insulin, they need due to shortages in the US. So, imagine Abe is a nurse who's interested in informatics and he wants to do a project around this topic. He's probably thinking, like imagining about getting encounter data or data about individuals and insulin and type two diabetes but the reality is this might not be very easy for him to find that so maybe he can adjust his question or do something slightly different so he can still do a project on the topic and maybe use a different data set. So I wrote these data set requirements for aid. So these are the reason why I have these here is like, when you go shopping for data, you end up really going down the rabbit hole because you have to like read a lot about the data set to decide if it's what you really want and sometimes you sort of are like well okay is it what I want. And it's nice to be able to look back at the requirements you wrote down. And you can change your requirements like you can change your project. You may change your project if you find like data you want or you just figure something out from this experience. But in the beginning it's good to just set up these data set requirements, so that as you go down the rabbit hole, you can look back at them and just see what you were thinking. And then I wrote down it must pertain to type two diabetics in the US who use insulin. That's who he's worried about. Okay, and it must provide insight into how insulin shortages may affect these individuals. Now, that could, we had all love it if there was a question on a survey or something about that but what probably we would have to do is look at. So in the cross-sectional analysis, if you were able to find the data like, you know, the people who say that they need insulin, do they have other like attributes that suggest that maybe a shortage is happening with them. You know what I mean, like you'd have to be kind of proxy about it. So that's a second requirement. The third is the more recent the better but that's not so much a requirement. And you'll find that that is can be an issue is recency of data available. All right. So, when you're writing requirements for a data set for your portfolio project like those three I just said, you don't really need a lot of them, but it makes sense to do it because you'll find a lot of data that kind of almost meets your requirements but not quite. If you want to ask yourself, do I change what I'm doing, you know, because it's not like you're in a class with homework and you have to do this and you'll answer these questions. And it's also not like you're at work, if you're just doing a portfolio project at home, you know, you can just do something else just so you can get a project done and sink your teeth into the topic. And so sometimes, sometimes you have to sort of adjust what you're going to do just to be fastidious or to get going on it. So an example is, like, let's say you think you want encounter or claims data. So that's data from like clinical settings and sometimes you can get it, you know, some de identified data sets exist. But a lot of them, you know, don't, or if they exist like what's in them and like what part of the encounter do you need, really, of the data set, or like if it's hospital discharge data that's a really good example. Like we were talking about each cup that they have a hospital discharge data set. But what a lot of people don't do before they approach that that that hospital discharge data set. And how I know this is I counsel them, like they'll be like I want it and I'm like okay I need you to sit down and just have a fantasy with me. Okay. All of the people in this data set actually got admitted to the hospital. So I want you to just sit and think about that like what happens if all these people got admitted in at different times in the past. You know, and so there's sometimes they start going oh wait a second you're right this is all of these people were just in the hospital for some period of time. And this is their discharge data. And what they realize is because they might want to know, like certain things that happened along the way during the hospital say, maybe that data is not exactly there. Maybe it's more like what it looks like when they're ready to leave. And so. So, sometimes when you sit down and you think of like well what do I really need for what, you know project I was thinking of doing, you know what exact variables maybe they're not even in the data set you were thinking of. One of the things because Mika was going to look at pharmacy claim data. One thing you and I had done Mika before is I had gone out on the internet to try and just find you some pharmacy claims data. And I had found this thing. This is the one from New Hampshire so New Hampshire State of New Hampshire had this guide posted for Medicare Part D prescription Doug event data which is like pharmacy encounter data. So why did I go steal New Hampshire it's because and this is kind of old, and it's not very long. It's because I thought I could see a data dictionary in there, right. And, and I'll show you the data dictionary. It's kind of like a data dictionary because I don't work with pharmacy claims data so I didn't know exactly what was in there, but it kind of makes sense. So, I'm going to go through some of these elements because they're in like all encounter data like you're always going to have a beneficiary D. Why are you always going to have that well the word beneficiary means the person has insurance like that's the ID of their it's like an insurance ID. And then here's a plan ID, this has to do. Okay, now this prescriber ID. So, let's say that we're talking about New Hampshire. Not everybody can prescribe drugs in New Hampshire right like not even every clinician like only certain people are approved to do that you know by their licensing by the state of New Hampshire. So when their license they are assigned an ID now I don't know about this particular data set. If you download that and you get a prescriber ID it might not be like a real prescriber like if this is a data set on the internet it might be like a number that would never you cannot look it up and you cannot figure out who that person is. But if data sets are prepared for analysis on the internet which I'll be showing you in a second. What they'll sometimes do is make it so you, you can find out about this prescriber ID. For example, is this a cardiologist is this a pain management person is this a, you know what I mean that kind of stuff. But I, I pretty sure that if they give out any data like this like for analysis that they make it so you can't tell who that is. So you wouldn't be able to tell who these people are, but it's important to have these ideas so you can link things up across like you'd be able to link the same beneficiaries claims. And then the pharmacy ID which is where it was dispensed. And again, you know, this might not be redacted the pharmacy ID I'm not sure you know depends on the state or who's preparing the data set. But this might also be like prescriber ID in that it might be redacted but it might, you know, crosswalk to some categories like retail pharmacy hospital pharmacy or whatever. And again, like I'm just looking at this I'm sure there's pick lists and stuff I'm not looking at. The reason why I went over these ideas is I think if you were going to do any analysis of pharmacy claims like let's say I was asked the question like, um, what drugs do cardiologists to prescribe the most in New Hampshire, and I was using this data set. If I asked that question I'd probably need all of these, you know, I'd want to know if the beneficiary, how many beneficiary, like if there's a cardiologist and this beneficiary how many they were, you know, filling per person. So I usually want IDs. And then after that do I really want everything here, like the data service. You know, the thing, me can I were talking about doing is this thing called topic modeling, which is where you take a whole bunch of, like a corpus of data, and you can sort it into categories, like how how it, you know, co current categories and it's it's a language model. And, and, and it had been applied in this article, using the drug name basically I think it would be this drug information product service identifier drugs coverage status code quantity dispense day supply. And this looks like it's like this huge long string. That's, that's the way I see it here. So I would kind of imagine that the topic modeling is based on this, this variable and you can break in Mika if you know more about data like this than I do. And this is other stuff. But then they you've got drug costs so let's say I'm talking about topic modeling, we don't care what it costs, you know, and we don't even really care about dates we don't care about payment and we don't care about that. The catastrophic coverage indicator, I'm pretty sure we don't care about that. So you start to see that most of the state of dictionary we don't care about if we wanted to do topic modeling of drugs, right. So that already makes it so like well maybe maybe I can just find a list of like the drugs that were being prescribed or maybe I can find some sort of similar data that isn't the whole claim. So that's sort of what I was saying is that you want to sort of sit down with your question or what you're trying to do. If you find, like for instance, when I let's say I wanted to make an upset plot like a lot of you see my blog about that is, when you have an upset plot you have a lot of you have these entities that have a lot of patterns, like, like this person, do they have diabetes do they have cardiovascular disease do they have arthritis do they have you know what are those patterns right and and let's say you had an clinical data set. Well, you know why you could reduce that data set to a whole bunch of ones and zeros like, you know, like, and you could just make a study ID. And you could just sample 100 people and put ones and zeros and suddenly it's not even clinical data anymore really you know what I mean it's not hippodata it's not anything. And then you can visualize it right. And so that's what I'm saying is sometimes, sometimes in data science if you're like trying to practice a package or something. Maybe you don't even really need the whole data set. Obviously if you're trying to answer a question about real live data, like, like your, you work on a production database and you're trying to like improve the IO or whatever, you know, this one work. But if you're just trying to get used to using a package or doing a certain process, then you can kind of cheat and find, just focus on the part of the data set you want for that exercise, and try to find that and just use that. All right. So, um, so any questions before I move on Mika. So, well, have you let me just ask you have you looked at any. Have you looked for any data or not. I did. You did. Yeah. Actually, I found that the CMS. So they are risk liable drug data for public use. Okay, well, you know what, let's, let's actually I was going to go to CMS. So when I get to the CMS one, I'll call on you and we'll go and look at what you did like you can tell me what to do. All right. Wonderful. Good job. Okay, so now today it's 2023. And today, this is my three main approaches and they not they aren't sort of like mutually exclusive. But they are kind of they're kind of like a gradation. So, one is shop for data and by shop for data, I mean, look in official data repositories and like, we were just mentioning CMS that's, I would classify that as an official data repository and more about what is official data repository what's like just data on the internet, you know, because two is more like just data on the internet. So hustle for data. So that's looking for data sets available on the internet that are not part of a repository, but these may require getting permission and these aren't like hustle for data is not like you necessarily find the data you can just download it. Right. Like if you find data in a repository, usually you can just download it or if you need to get permission you just fill out a form and click something and then download it like you don't have to apply and get approved or anything. But under my hustle for data, I put, I put that where you, you really are finding like, maybe like a unique data set that's hosted by a particular society or particular organization that they make available but you've got to get approved or whatever, right. And then, finally, the last one is make your own data. And I don't really mean, like, do a survey or like go through the IRB and do data collection like we do in like epidemiology. What I mean by make your own data is either you collect data off of the internet, like off of like websites like for instance, once I was thinking of doing a project I was wondering about public health budgets for the different states. And I thought, well, I could, I noticed that there were some states that had their public health budget online. And I thought, oh, well, you know, you could just collect data I mean n equals like 50 or so, you could just collect data off of that into a spreadsheet and do an analysis to see what you would find. But, you know, like it was just a thought. But that's not, that's more like data abstraction so we'll talk about that. And then, then, but what data scientists usually do when they want data off of the internet and it's not available is that they scrape it. So we'll talk a little bit about the options for scraping and you know how that can fit into your portfolio. All right. So the first one we're going to try to do is shop for data in repositories. So shopping for data in repositories is pretty easy, because the repositories are well organized, and they have search engines, and their data sets are classified. Now, when they first started inventing these repositories, I was really skeptical about it, because I ran a lot of databases. So imagine you get a grant, and you do data collection for traditional like epidemiologic study. What you end up with at the end is you end up with usually a data set kind of like in SAS format. And it's like an official analytic data set for the study. And it's usually sitting on some bio statisticians, like PC or somewhere out on on a server somewhere. I remember, I would, I would observe that and I think well why aren't we sharing this why can't we, you know, easily share the study, you know, but I hadn't really thought that for far ahead, you know, if you really want to share data from that study, it's a lot of work because you've got to get the protocol together. You got to get the study forms like you don't want somebody just downloading that data and not really knowing, you know what happened. Right. So what happened was when these repositories set up they set up rules about when you add data the repository, you know that you have to add certain documentation is very similar to like, if you make an R package, and you add it to the CRAN server, you have to follow these rules, right, because that way you know they keep the quality up. Now I remember I had a customer who had found an R package that was not on the CRAN server. But we used it anyway. And I realized why it wasn't on the CRAN server is because it was doing something that's not statistically correct. Right. I even found a paper about exactly that little application so there's a reason to go to the repository but first of all, you know the provenance of the data like you really know that it really was from the CDC or really is from, you know, whatever government put there, put it there, usually it's governments or it's universities. And these repositories are made for analysts to just be able to download the data and, and use it. But the problem is much of the data are not very useful for various reasons. Certainly because other people have already used them but if you think about it, the hypothetical I gave you were oh we did this clinical trial and I have this nice data said, well don't you think all my friends on the clinical trial have already written all the papers about it. So, if you're like going okay well, if they're doing scientific research yeah there's probably no other answers, but if all you're trying to do is like visualize patient data or come up with like a patient data dashboard or something. Maybe the data is just fine, you know it's real life data that you can just use for display. So it really matters what project what kind of project you're trying to do with it. All right, so let's see here. So I put two on the slide the ICPSR which actually stands for something. But they're like KFC like, you know Kentucky fried chicken like it used to call that and I was just KSC, and you're not supposed to remember it's Kentucky fried chicken and ICPSR used to mean something but you're not supposed to know anymore which is good because I can't remember. Below that I've got the CDC this is like that National Center for Health Statistics but that's kind of complicated too. So I was going to show you about shopping for data and the ICPSR. Oh, just one thing I want to say is it's very hard to run these repositories okay. It's very labor intensive and you you know it's a lot of work to serve data. The problem is, I think that there's not much of a business model in it this one that we're going to hear the ICPSR is from the University of Michigan, and, and you know these other three. These two are government ones and I'll show you what this is this. This is kind of a weird one maybe even Mika knows more than I do about it. But here we go. Okay, so at first, you know whenever I get to this web page I always swoon because it looks like there's so much data right, you can go find data, and we'll go find data right. So I'm going to sort of show you how I would approach Abe's problem so remember Abe is wanting to find data sets about the US about people with type two diabetes that is so severe that they're using insulin, and any sort of information about maybe insulin shortages or trouble getting insulin. So, you know, let's say I do diabetes. And maybe United States. Okay. So I get a lot of studies and what's kind of interesting is they have these sort of coded by variables like they've done a lot of a lot of work to make it easy for me to shop for data. But nevertheless, like you can see that these are really old data sets like really old. So here text message outreach for complex patients with diabetes and Denver will think about it maybe that's, and that's only until 2011 2012 which is, you know, kind of a long time ago but not so bad, and Denver is in the US. And this looks like just a group of people who did a study, and they post their data set and I, I believe that if you write a grant to do. And you include the grant funding so you can get your data together and posted here. Maybe that's how they were able to afford that because that's not trivial right. So we'll go to this data set and just see if maybe there's some. We can figure out even if we could just figure out who in the data set needed injectable insulin and any other variables about them. So I don't know I'll click on this variables here. And let's just see what we got here okay. It looks like they don't have their variables set up sometimes they do because I've been clicking around here. So they just didn't add that let's do data and documentation. So here you can download so let's download this data dictionary okay. Now that I'm actually doing this I'm getting this zip file. I'm feeling like I want to keep track of this. Okay because I'm actually looking at it. So I'm going to go over here so this is going to be. I'm going to just copy this over here. So we'll say, so this is Abe. Okay, so this is, I'm going to copy this here. And then. And what I did was so when you copy from the internet. You know from my previous stuff with HTML and stuff it's really ugly like for example, let's say I just copy this. And I, and I just do paste. Look at how ugly that is like it's carrying all of that information. And undo. But if, if you do this if you go up to paste, and then you choose this a one, which is that's where it strips all of it's basically paste special unformatted text, I guess, see that. And then it doesn't do that so that's what I'm doing there if you're curious. So, so I've got that so when I copy from the internet I'm like that. So I'm going to, I'm going to copy this and I'm not even going to bother to say that it's ISPSR because I can kind of see that from the thing. Okay, so now I've just downloaded a data dictionary. So where did that data dictionary go? Oh look it's only four pages. Let's see your view. Let's see page fit with. Okay, so as you can see this was kind of like their standard stuff that they had to create with this. And we'll see here okay so this is the data dictionary. Now if you notice, like, this isn't very easy, like we're probably going to have to read a lot more about this to be able to see it so let's hear. I'm just going to do insulin. Okay, that's not there. Okay, subject on time date. You know, what I'm looking at this is it looks like they don't really have that much information about the actual participant like the said culture here I don't know if that's like a throw culture or what. And it says gender and age, but it really doesn't have anything about them. Really. So, so I already have this I might as well save it remember it was in that zip file. So I'm going to hear put this. I had a one page dictionary and did not say much. Or did not say, and anything about insulin or much about patient. So that one was no good. So you can see how like each time I get to something it's going to be like kind of like this right so you, but you start to iterate you start to get better at it like I was attracted to that because it was sort of recent. This United States renal data system that I think that those people might be too sick. Let's see here. Census track classification now. Companion files for racial differences in patient experience and diabetes management outcomes among reproductive age women so we don't know. Even if this is in the US, but let's just take a look at it. So they have these papers. And it's like, if you read the paper, and you can tell, like for example like this patient experiences. Sometimes you can tell from reading the paper. What data they have. So this is from medical. No, this looks really complicated. And on top of that, so this might be focused on the gestational diabetes. Oh yeah you're right it might be gestational diabetes so we'll put this here and say good call yeah that's another problem is and I don't know enough about gestational diabetes to know about if you ever end up injecting insulin. So let's see here and we'll say, oh, no, let's see here confusing probably about gestational diabetes. Okay, so we don't want that. Um, what I want to show you is like, um, there are ones that I found that are like here, if it's the United States, let's see here, let's try this. These are really old, but I want to show you one where they've really filled it out. Like here I think let's see here. No they didn't fill this out. There's somewhere they filled it out. Let me see if I do. I don't know if that'll help or not. Well here maybe multiple cause. I mean this is 2005 but let's see if their variables are in here. Is nothing happening. Okay here's data and documentation. Okay, and variables. I guess it's loading I was going to. I'm surprised that I click on variables and it's loading like that. But what that suggests to me is that there's an IO problem and what suggests that suggests to me an IO problem for a problem loading this. Like this is literally no data. Right. If you think about your bank account. And whenever I see an IO problem like that loading data I think sass is the back end of the dashboard and that's always a problem because that means you've got a non relational back end is going to be hard to shop in this data. So see this act to age recode 12 age recode 27 age. These kind of things are what you get when you have a flat flat databases you get a whole bunch of like empty variables and so that's another problem with ICPSR and a lot of these is that they've got these flat ugly data sets. Although I do want to show you what it looks like when it's actually filled out. Let's see or yeah we did. Yeah, that's those variables or not. Like I saw some small ones. Oh, we're building infrastructure for comparative bicep. Let's see what that is. That's cute. Okay, let's see your variables. Primary language so yeah department and sit and play it lifts and mean a suppressence. Well it looks like. Hey, it looks like maybe we know here, like, there's some pretty good diabetes. Well this looks pretty good. Well let's figure out what it's about. And I go that's what this is. The prime was to advance analytical of observational with multiple. Okay. In adults with type two diabetes like this coupled with additional chronic diseases. Okay, these are sick people. What is the comparative effectiveness of T2DM medications and cheating comes lysemic control was a comparative fact that this, different across some groups based on demographic complex comorbidities other types. So this suggests that we'd have maybe some data in here. Let's see if they fill in the variables. Oh yeah here. Yeah this looks pretty good. So this is one that I would actually look into more. So I'm going to actually do that later during the workshop and I'll, I'll, I'll report back I'll see if I could find a data set for Abe or something that a can do with this or decide it's no good. All right. All right so hopefully. So, so hopefully what you found is even though what you see from ICPSR is, even though it's a really good platform. It's got a lot of data on it it's got a lot of stuff that you could probably use in a portfolio project depending on what we're trying to do. And it's really organized. It's still is very tedious to go shopping through it and keeping track of everything you found and deciding what you're going to do. You might end up going back to some data sets you looked at and someday and like saying okay maybe I'll use one of these or one of those. But it's nice to know about the data that's out there. Now I always am. So the ICPSR is at the University of Michigan. And what does that mean, whenever something is at like a, like an institution like a university or something. It means that there are people there who really care about that project and they really it's their pet project. But if you're if like here this is the CDC and this is CMS. That's the government so when the government's hosting a platform it's more like this is part of its job. So, and that's not to say that you don't have some really good platforms. From the government like BRSS has a lot of really good documentation, but it's just awesome often the the institutions who are hosting it they have a certain level of pride in it and so. So I just found this public use data files and documentation. And this was not very easy to use. This was more like a bunch of pointers to like here, if you go to NHANES I'm a little familiar with NHANES. And this is a surveillance data set, but it's super fragmented let's see if it's still fragmented yeah. So let's say you wanted to analyze this data. Let's say, you know, we were talking about lab data. Let's say that we wanted to download the lab data. Actually, it's probably available from, you know, that's funny to pre pandemic. They're so pragmatic. But anyway, so let's say that you wanted to take the lab data, you know, here. So each person who was in NHANES got this test done to them. Right. Well, not every single one each one of them that that participated in the lab part. Right. So this could be data that you could download just this lab data if you don't care whose labs these were. You could just use it to make scatter plots and stuff, but normally you actually care whose labs they are. And so you would have to connect the lab data with the demographic data to figure out like the the race of everybody. And if you wanted to see, like, I'm seeing this I'll be in creating like this is bad. Like if you've got a human and creatinine in your urine, it's like probably having kidney problems. So I would want to know about maybe dietary data or whatever. But the problem is once you go and you you hook all these together. All the people didn't participate in it. And you end up with these really small data sets. Like, some of them that gave the labs, you know, they don't have the information about diet, it's so frustrating. So that was like the CDC one is basically another rabbit hole. So you have to go and study all these data sets. And again, I'm concentrating on like health data because that's what we do but you can, you know, find other data out there like that. And then CMS this. So when Obama was president is when a lot of this administrative data got their rules were created that you have to have it available. And you have to have places like this. And you know, I went on here and I did explore data. Let's hear pharmacy. Oops, I better spell it right. Sorry, no matches. So, Mika, maybe you can tell me what you did. I started I literally started with Google, because Google search because I know CMS is sort of like a maze. Their website is always big maids to me. I didn't want to. Okay, no, that sounds like and actually, I'm going to show you so Google tells you if you say farm must see claims data, let's say I do that. I can do site equals CMS.gov. I don't know if that's really going to help but it's supposed to be. Yeah, so the CMS that's a part of the party. So this party claims data maybe that's what it looks like I it's this where you went let's see here. Party claims data. So this downloads are PDFs right like these aren't actual data. Actually, so they if you go to that route it eventually you ended up to go to that rest. Yeah, actually, I found I found this Resdac, you know, Resdac was at our place at the University of Minnesota, do you remember that. No, I don't. But usually they ask that. Yeah, missions and permissions and agreement and all kinds of stuff. Yeah, I don't particularly like it because it's cumbersome. So cumbersome to just go through and navigate through our work. So there is another version of CMS data called public use data. I can share my screen if you like. Oh yeah, why don't you do it and yeah so that's where the CMS is. Okay, CMS and initially I got into this page. I see. So you managed to find like the magic page. Yeah, the public use file. Yeah, they really bury them don't they. Oh, they bury it. They don't really come up easily. But did you actually get to the point of downloading the data and to make sure it's really that it's really right. Yeah, if you click or for example this one and your data. I see. And so see how it's got this application where you're supposed to select what you want and then you can export it like which columns. Yeah, yeah, actually this one is great. So they're prescribed by an API that prescribed by information that their name last organization name first name state state abbreviation. Okay, so let's talk about what you would notice that that this is page one of 2,500,000 right. So this is where you could start with this so let's say that this, you know, you wanted to use these data for your portfolio project. What I would do is I would first figure out what are all the column names and figure out what they are and which ones you actually want like I create a data dictionary in Excel. And I'd figure out all the column names you might be able to download them or something like that. Yeah, or I just put all the column names in order. And you know how I do I went to 345 and then next to it, I would just write notes about whether I wanted it or not. Right. And then after that and see that filter where it says, or it says manage columns and filters so this is my fantasy is manage columns is where you get to pick which columns you want to show right. So after I had that data dictionary and go up here and I'd select on select the columns and I'd go cherry pick out the columns I wanted. And that would be great. Okay, then the next step is in my data dictionary. I'd say okay, what rows do I want like inclusion exclusion criteria, you know what I mean. And so, like, we were looking at that Zafari article so what happened in the Zafari article. Well, they were analyzing claims for the state of New Hampshire so that would be like prescribers state on there equals an edge, right and a right. So if you wanted to do an age, then you could do that or you could you, I would recommend you choose a state. Okay, because it's just regionally it makes more sense okay. Now remember how Zafari said well we want to compare the pain management prescribers versus like the cardiology, or I forgot what they say it wasn't cardiology it was something else prescribers. So, like here you have prescriber type or whatever. So the problem with this is we don't really have the data dictionary made right. So let's say you were making your data dictionary and it said prescriber npi prescriber last organization name prescriber first name. Prescriber city, you just be like okay well maybe I don't need the city, I don't care what the city is right, but under prescriber type you'd be like oh what what are the types what are the choices, right. I see internal medicine but we're going to have to do some digging in our data dictionary and figure out what the choices are. So that's where you would go okay let's say I'm going to filter by prescriber type. You can make a you know a tab, a pick list tab and figure out while the values are in there, and then make a choice to say okay I'm going to just pull in these like you put it on your data dictionary you know exchange your mind right. Now here. It's great so the drug brand name drug generic name. Yeah you really did a good job of doing this but one of the things you know what I'm basically coaching on is. So. This is a little off topic but it's kind of not off topic okay. The government doesn't want you to have its data. Okay, so what do I mean, I mean all over the world. Right now there's this movement towards especially democracies passing laws that governments have to open their data. So there's laws being passed in the Netherlands and laws being passed in the UK and laws being passed in the United States, where the government supposed to open its data, make data public, right. Well, there are two main complaints. This is evidence based from the government for making it public. The first complaint is it's too much work, although, and you kind of have to agree with them because this is portal is amazing. Right, and it was a lot of work. But the second thing that they say is something very annoying, and that is that they don't want oversight. That's basically what they're saying is they don't want you to actually get to the data and analyze it and try to hold the government accountable. So what ends up happening is the government's like okay we have to serve up the data by law. So let's do a crappy job. Well, then the law comes back and says you're not allowed to do a crappy job you have to make it like usually the first response will be okay we have the data but it's in a PDF no no no you have to make it an Excel. Okay we have it but you know, it's buried or whatever. And one of the strategies that they've done is to create these portals. Oh, well you can pick any data you want to download you just have to manage the columns and put the filter on. Well that's going to take you a very long time to sit down with the data dictionary, look at everything and decide what to download. Right. Like wouldn't it be easier if they just gave you SAS data set and you just thrown in SAS or they gave you, you know, a CSV just throw it in Python or are you just throw it in some profiling program you know what I'm saying. And so I, on one hand, I really like it that they created this. So you could make a data dictionary. So you could write all this down. So you could replicate it so you could write a method section. But they did make it like as onerous as possible. Like, can you get even more to display per page like, can you get it to be less than 2 million pages. See where it says yeah. Okay put 100 that'll help. Then we can see more, you know, like when we're in that. Okay well that didn't work did it. Oh, he still has to scroll down. All right. So, do you know what I'm saying Mika about making this data dictionary, where you decide in the data dictionary what columns you want. And you decide also sort of the inclusion criteria, it's like you write that down you play with this data base, and write all that stuff down before you actually set up that manage columns in the filter and do the export. So you would get what I'm saying. And so the, yeah, I can probably select that so press like a state on your Hampshire or California or something like that. Yeah, but do you get what I'm saying about making a data dictionary it's like you sit down and you open it up and you don't know what to say so you write some stuff down and then you play with the data and then you change your mind. You know how like in brfs s they have a code book. So if you look at the state one, it'll say how many are in Illinois and how many are in New Hampshire whatever. Well I don't know if there's a code book for this right so you're pretty much going to be like, Oh, okay well let's say I do put a filter on California like how many are in there, do I really want California you know what I mean. So you're going to have to sort of. I would recommend shopping for columns first and deciding what columns you even care about. And then after that. And it's kind of like, think, do it the way, do it the way you do when you're shopping for address where you just go grab everything you think is going to fit. You know, in the dressing room and try everything on once and throw it all out. So just grab any, any columns from manage columns that you think, or not grab them but put them in your data data dictionary as like X like I think like this is a candidate column. And then, when you go to decide the inclusion exclusion criteria like what rose you're going to accept what values those columns have to have with you know obviously you don't care about prescriber npi or you don't care about some of those you know I don't care what the value is. But you care what the value is of state, and maybe you care. I don't know if you care what the. Oh, I see that fits code. That's easier that's the same as the state it's like 25 is like Massachusetts it's easier to set the criteria because it's just a number. I just noticed that. But like, maybe of county county has a fifth code on it to you might end up taking like a big county Los Angeles County, I don't know. But you see what I'm saying is that as you play with the data on this portal, and you see what values are in the fields then you can write down what inclusion and exclusion criteria you really want to apply, because we want to keep track of all that because then you'll set it, and you'll do that export. And you'll grab the data. And if I don't know I keep paying my taxes you probably are paying yours so this will keep getting updated and you can go and maybe get more data like if you ever want to replicate it or whatever. So you get what I'm saying about that data dictionary like using that to help you nail down your criteria for your columns. Okay, great. Well, good job great job like what a fine. All right well I was going to give us an activity to look for a repository but you already found your repository so I think we'll just go on with that because you did such a good job. All right. So, so just to wrap up that part. It can be hard to find a repository. If you find a repository. I was more in a repository okay so this thing over here. Thank you. Okay I meant to be showing you this. So see this ghdx dot health data dot org. So what this is is, I don't kind of don't understand it but it has something to do with the Lancet. And to do with this this IHME is a think tank healthcare think tank at the University of Washington. In the US so something to do with the Lancet Journal in the UK and them. And I was intrigued by this because I thought you could download data from it. So, search data. This is about, like, I actually have some trouble understanding what it's about like these IHME data sets. We are supposed to be able to download. Let's hear that maybe this is the tool. Okay. Yeah, this was a query tool and I could not get, I could not figure out how to query this. This is a customer who was from a particular country and in a region of the world and she wanted to compare like, like these, these statistics between the surrounding countries in her country, and we worked with this for so long and we could not get the data out. But I'm noticing that this results tool I, I don't even know how it works so you'll you'll end up with situations like that where you just cannot get the data out. I won't look I'll look it up later but there's some some of these query tools and they're just so hard to use. Now, there's another repository I want to talk about but it's not one that's very easy to access okay, and that's the military health system. System data. The military health system data repository so if you're hanging out with the military like I kind of do. You'll have people call it MDR or MHS. And sometimes I call it MDS because they get confused or it's kind of like military data system. So, the reason why it's. This military health system data repository thing. These, the military's been doing that for years, like years like literally since the 70s and 80s. And part of the reason why it's possible for the military to have done that is military treatment facilities are not the same as you know like like treatment, like, if you're on a base, like Fort Bragg and you get sick. And you go there those those places are not like any, any of other places in the health US healthcare system. You know you can't go you can't just have insurance and go there. I mean I've been there because I was at work, and we were visiting we were interviewing people at those clinics I didn't get sick and go to one but a lot of people get sick and go for them you know they're very functional clinics it's just that what's interesting about them is they don't have the usual claims that you get, but they have their own data and that's what's inherent. So, you can access this data but you have to get permission. And you're probably like oh my god how is that but actually that I don't think it's actually that hard to get permission. This data dictionary here is a work of art. Okay, and I remember talking. I don't know if the same person's working there but the person who invented this data dictionary is a work of art. Now I wouldn't do this, modernly because this involves a lot of macros in in Excel. See how I, I'm enabling the, the, I'm enabling the macros because I know it was made on the military. But see here, these are the detail files. See this detail files and reference tables. These are basically the pick list. And these are the main tables. And I'm really familiar with a lot of these data sets, like there's one called sitter, which is like an inpatient data set so and say there is like an ambulatory data set so let's say I click on the sitter. It goes right isn't that beautiful. And then here's this beautiful data dictionary. So the reason why I even bring this up is this this is like if you actually did do a good job of shopping that data dictionary and you really did not need much data from this especially if you didn't need identifiers. It might not be that hard to actually get this data and do an analysis as long as you wrote a protocol and stuff. So, so I did want to bring up the health dot mill. All right. So, so when you're scrounging around, and you're looking for data on the internet. And Mika, you did a really good job. Let's say that I'm looking for all we'll do like with Abe will do like. Diab. One thing I like to do is I like to include the word download so download diabetes. Oh, it says diabetes is CSV. Okay, see how Google filled that in for me. That made me think okay well a lot of people are doing that so let me go see. The problem is a lot of times what I was kind of looking for some of those. I can't find if if they were using insulin or not. Let me see here, or if it comes up for insulin it's like there's an insulin measurement but I don't care I want to know if they're using exogenous insulin. So let's see here. So here, this is University of California Irvine. So it makes me like wonder maybe I can just get their data set right. So, um, this data set is from a I am 94, which suggests it's really old. But again, you know, the this doesn't say like regular insulin dose like. Okay, let's see here. So automatic patient records were obtained from two sources and automatic electronic recording and paper records yeah. And so this isn't. You know very use like that's something you could use to like practice an algorithm or put in like a dashboard, but it's not like something that you could, you know, really analyze. So I found this data world is kind of this new repository, but I don't know if it's any good because it looks like anybody can do anything at data world. Let's see here. Oh, this is the same data set I was just looking at. It says here there are 36 diabetics. Well if anybody watching this, Mika, do you know anything about data world. It's just around a little bit. It just there are lots of things in there. And it's hard to find what I need. You know, I think it's an aggregator. I think it just trolls around on the internet and copy stuff in automatically. I think so you get a lot of that stuff you know like, if I go like, like I was looking for best yarn store in Boston. I'll get stuff like this, I'll get Yelp, the top 10 best stores in Boston, and this is just an aggregator. This is not like an article in Boston magazine about the best yarn stores, you know, and that's what I think that is this is an aggregator. So I don't know I'm not a big fan of. But anyway, so. So, so this, this is the second thing is where you just scrounge around on the internet and you look for data sets. Okay. What are you going to find, you might find hosted data sets like that one at that University California Irvine, you just download it. And you might find like what you find what I found at health.mil where you have to, you have to ask for permission. But before you ask or just like the ResDAC like we were talking about before, before you ask for permission you really need to write a little protocol and have a kind of a good plan like what Mika is trying to do with journaling is probably well enough planned so that if, unluckily, she had to request data, she'd have enough to write down a protocol, and then be able to, you know, make a data request and explain what variables she really wanted and which ones she didn't need, and, and, you know, actually just sign up for that stuff. What, you know, it's funny, like, Mika, when you were going to the ResDAC, you're like, Oh, you have to file this paperwork you have to do all that. It's not everybody's response but one, one of the things that I find sort of interesting is, you know how like, you found that awesome web page where you can just select the data and download it. But I just told you you're going to have to do hundreds of hours of work on the data dictionary before we really know what you're going to, you know, before you really know what you want and what your data is and what you're going to download. So that's the same work you have to do when you make a data use agreement, right. And so you end up kind of doing the same things. Like ResDAC is just like, like the military like everywhere else that you have to apply for data, you almost end up doing almost the same things if you actually file the paperwork. So, when I explain to people that, then they feel trapped. You just can't get out of it. You just can't get out of the paperwork. And that's why the reason why you can't get out of the paperwork is because some of the paperwork is just about replicability. It's just about the methods section for whatever, whatever your methods is going to be, because you're going to want to be able to tell people, this is where I got my data, you know, this is, this is what I filtered in. This is why, you know, and also in data science projects, sometimes you start by filtering in New Hampshire and trying something. And then you realize like there's something wrong with New Hampshire, like it has no surgeons or something like that. I mean it obviously has surgeons, but maybe the data doesn't have any surgeons or something like that. And then you end up having to do another thing and you can write a blog post about any of these things, you know, that you find because you know it's that's what I kind of like about the data science journey is it's more about the journey. Like in science it's more about the end game, but in data science about the journey. Okay, so, so Mika any questions before I move on to make your own data. All right, so there's, so let's say so Mika succeeded. We've got stuff for her. I'm going to help Beth look for some stuff for her we're probably going to find some I you know I saw some labs in there. My, my Abe friend will look for some stuff for him. But sometimes you, you can't find what you want. You kind of do find what you want, but it's not in the shape you want. So, the, the two choices you have when that happens is data abstraction and data scraping. Now, I'm not including in make your own data interviews or surveys or anything, because that's like IRB stuff right like you would have to deal with ethics now. I have a business. So if I send out a market research survey that says you know what classes do you want or what courses do you want me to make or whatever. That's not an IRB thing if I keep it, you know, anonymous. But if you start like, actually, like doing surveys of how people feel about things, you know, that's an IRB thing. So, so I'm separating that from this I'm talking about like stuff you can do that's non IRB that's not going to be considered human research. And data abstraction is actually the oldest game in the world. It's the game we used to play back in the 80s and 90s. So, I actually never did an abstraction. I shouldn't say that I probably have done. I've probably done small ones, but I've never done a big abstraction project, but I remember an abstraction team doing one. So what had happened was, I think it was that ahead of the county. There was some question about, I think it was gestational diabetes, like, I remember this doctor had kept a list of patients who had gotten gestational diabetes and she was worried about them. And she wanted to do follow up. So what they did is this list had the patient ID, so they could write the protocol and the protocol said that they would look up the patient ID. And the data abstractors what they did was they created a form that the data abstractor would fill out, and they created instructions of where to find the data in the medical record you know they were, you know, paper records and they were organized a certain but where where to look for that data to put in each field. And then, you know, they were allowed, you know, on a special day they were allowed into the medical records room and the records have been pulled for them, and they sat down with each one and they would fill out that form. And then they bring out the forms, and then somebody do data entry and turn it into data. Well, how that would change today is like so I'll give you an example of one that I was interested in, and this was like about 10 years ago I was really interested in the difference in the budgets of the different public health departments in the different states. So if you assume 50 states which is not right because there's Guam there's other territories and stuff but if you assume 50 states, you would have like 50 experimental units it's kind of like 50 diabetics right that you're abstracting data from so I realize that most public health departments have their budget online. They most have a web page and they have a lot of information about them. So I could structure it like a, like a protocol where I went and I looked up all 50 of these and I gather data about it, and that would be a nice abstraction project. So I want to make sure everybody understands that abstraction projects on in the public domain are really useful. You don't, you might not have a lot of add like, they're not actually that many hospitals in Massachusetts because it's not that big. So, you can do a lot by just visualizing hospitals, or visualizing even like pharmacies. You can do like, actually, I'm going to go over here because I just, there's this thing called Hipsa, which are health health was health provider shortage areas. And there's actually this Hipsa database that you can study. And again, this is not about it's the experimental unit is not people it's like territory or state. And this would take a while I was trying to understand I was helping somebody with something in North Dakota. So I was trying to figure out will like select counties. I guess all counties. So I was trying to figure out what are these shortage areas right. And so this is, is it Billings County, I mean the whole county's a shortage area. So I would have had to like really kind of study this data before analyzing it. But you can and like, if you don't like the data in there, you can actually add your own like you can look up Botnok County or Billings County or Bowman County and add data to that like what is their median income and things like that. And especially if you're just using a small at like just the hospitals in Massachusetts that's not that big you can look up a bunch of stuff about each of the hospitals, and just create your own Excel data set. And, and again you have to keep track of everything you did, or else it won't be replicable. But, but that's data abstraction. And the reason why I'm a big fan of data abstraction is because I'm a big fan of measurement and I'm a big fan of human brains, and not artificial intelligence, because when you sit down and you create structured data collection that has to do with some research aim, you're probably going to do just a really good job of analyzing it. I mean, I like I give that example of that casino paper I did with. And I could come up with some sort of reasonable comparisons between these casinos, and I'm not like this huge casino expert, and it all really had to do with. They had just a little bit of data online and it was like in a PDF I had to like, it wasn't a table but it was like a PDF and I, I just did data entry. And I looked up other stuff about the casino online like what square feet it had and stuff. And so I'm a huge fan of data extraction abstraction and data entry and all that. But I'm the only one in the whole world, I'm pretty sure. Because what everybody else likes to do is scrape the data. Okay, so now we're going to move on to data scraping. So just to be clear what I mean by data scraping is this I'll show you an example. So you can probably imagine that all I do is think about all these projects I wish I could do and I never have time because I don't know, or maybe I just don't have the political will. I was thinking what if I went, I live in Boston but I grew up in Minneapolis what if I went to Minneapolis. And I stayed in a hotel instead of staying with my relatives, right. I don't know I just thought of that. I think I was asking that because I had to go when my dad was going to be out of town or something but I don't know I once had this question. And I searched for Minneapolis hotels on TripAdvisor thinking that I would get some good ideas, but what I realized is I had a lot of trouble comparing them because I actually kind of knew them. Right, like here, I'm sorry I keep getting this. So here is this holiday and express and sweets and downtown Minneapolis and I don't actually, I can't remember exactly where this is, but it's like got four dots, and I kind of a recollection this is not that great of hotel. I'd never heard of this healing hotel. And I, let's see there was one that came up it was this Norman to hear this one those best. And I remembered this one. Right. I'm going to laugh while I remember it. I remember it because in the movie Fargo. It's that murder takes place in the parking lot of this building. But anyway, so I remember this Normandy sweets and I'm like this is kind of an old hotel like if you kind of look at it's got it's I think it's kind of cool looking I like that. But I was like, well, well this is like four and a half best Western is this really that good. If you look here, this is the overall rating is 4.5. But each of these has its own like 4.5 4.6 4.5 4.3. And I've always sort of wondered how do they combine this to make this. And I'm assuming it's sort of weighted, you know, because each person down here I guess you can read this. My point is this sort of cool old fashioned sort of 60s looking hotel that's apparently pretty high rated. It's four and a half dots. And so is like, you know, like this healing hotel which is totally new, right and I don't know anything about it. But I was like, well, the problem with all this is is that I kind of wish I could compare these, these things. Right. Like this value is low on this one. Like this is not good value where's that other one I think the other one was good value. So I had to like the value 4.3. But you see how I have to toggle back and forth to, to do all that. And it's annoying. And what if I wanted, what if I had the, the hotels like you'll notice these are in downtown Minneapolis like downtown Minneapolis is like a grid. So I could draw on that grid exactly where I wanted this hotel to be. And so I could like scrape the data so I could set up a situation because what you'll notice is that this is a standardized page. And how this is displaying is there's a database behind it. And like healing hotel is in the, in the header. And it knows to put it here. And it knows to put this four and a half. This image of four and a half dots here. It runs, it's got some value and it runs some sort of routine to know which image to display there. And, and it, it basically creates all these labels and places them based on the underlying data. Okay. So of course if you're like me I wish I could just break into trip advisors database and just download the data right but you can't so then you're scraping it right and when you scrape stuff, which I haven't done but I worked on a project with Natasha, where she did the scraping. Basically, what you have to do is you have to, first of all, you cannot scrape unless you have a situation like this, you're basically reverse engineering, a database report. So this, this report, you know, when I clicked on, like, let's say I click on this Ivy hotel which I think was under construction when I was there, I don't remember. It's been so long. But let's say here that I do this Ivy, oh I guess it's a collection here. I have to click on it. So, let me see if I can find this. Well, no, now I'm on Expedia. See, this is a problem. So, you pick this property and trip advisor. Okay, so I don't want to be on Expedia because if I go on Expedia, it's going to put all the it's going to be different. It's going to have different, like, like a different location for all these things. Let's see here. So this is not. So this says Expedia, you deal Expedia. Maybe I should have just clicked here. Like I don't even know how to use this website I guess. Well that's it I clicked in the wrong place. Well that just shows you I am obviously don't know what I'm doing. So here. So we would have to probably what would have to happen is I would have to open all of these pages. And I would have to program whatever the scraping thing is to tell it what to scrape like to tell it to scrape this. Maybe scrape this and you see it's going to be hard to scrape like an image right. Let's see here. I'm clicking and saying open image a new tab. Oh, it's not. So a lot of times web pages like this that are very professional will walk their images down. So this could be then a problem. Let's say that I wanted to scrape this I probably could get this scraped it I could probably do that. But as you can see it is really hard to reverse engineer a report, a database report displaying to the web to scrape the data out of it. But there are reasons to do it. Now, if I were to do what I was just describing, like scrape some data from TripAdvisor so I can compare some Minneapolis hotels. And I made a little portfolio project about it. Probably TripAdvisor wouldn't have any problems with that. But if I went and I scraped their whole database, and I started offering it like you can buy TripAdvisor's database from me, they probably care about that. So you have to right now we're at a point where it's kind of like, it's not really clear what the rules are for scraping data, because the other thing that could happen is I could just look these up, let's say I look up 10 of them and do data entry. Is that illegal? I mean, no, probably not, right? You know, I mean, when you think about it, all of the data that TripAdvisor is putting in there is publicly available, you know how many rooms they have when it was built. So you start asking the question is, when is it that I put together public data, and it's unethical, right, like I'm basically stealing someone else's data. Or, if you think about the situation I just described, where you're abstracting data according to some protocol, you're basically creating a data set if you did that, like you could sell that. If you, yeah, I mean, people do, like, there are research organizations that all they do is stuff like that where they basically say, I'm going to collect data about all the new like course management system platforms. Like, like, you know, and the reason why we don't hear about those places a lot is you can't you have to pay them for their report. Right, like I was reading, I was helping the customer or something. And we were reading Oh, this was it. It was, it was that insulin thing. There was a report out about how people in North Dakota were with some calculation about how they couldn't get their insulin, who needed it. And then when I went to rehab, I was like, Oh, this is a peer reviewed article. Well, it wasn't. It was a market, it was a market report that had been commissioned. So it's not peer reviewed article, you can't get it. But that's what those companies do and they'll collect data from public domain and put it together and then sell it. Right. And that's legal. I mean, that's a, it's a business. But if you go to steal trip advisors data by back engineering their entire database, that's bad. Now, remember the thing that I told you about on about open government data that we make these laws that the government's supposed to put together this data that we all can download and we all can analyze and they usually comply but sometimes they sort of comply in a way that is very hard to use, right. And that's actually what spurred this project that Natasha and I did. So this is a dashboard Natasha made from data we scraped. And you might be like, Oh, is that illegal. Well, actually it's not because the data we scraped is from the national. We believe it's from the national health, NHSN, which is our data. It's the data of the United States that NHSN, the National Health and Safety Network, this data it's kind of like CMS and stuff like we were just looking at right. But unlike, like I can't download the data like they don't let you download it. Okay, and remember this is about healthcare facilities it's not even about like individuals, like like people so you would have to worry about redacting it no it's about healthcare facilities is not even about all of them. It's about the ones that report. And so I learned that there was a law in Massachusetts that they had to make this data available. They had to make nosocomial infection data available so I guess people like me could shop around and try to stay alive by choosing the hospital. But what ended up happening is, they said, Okay, well first of all I can't download this data from the feds here. So if I want to download it from Massachusetts so I'm trying to see that here is these healthcare infection reports so you go here. And this was, I think, um, so here see this HAI interactive map. So this this is a PowerPoint, and this is a document. So that's not an Excel, right so that's not the data there's no, you know the data is not here So we go to this interactive map. And we have this. And I sat with this for a very long time and tried to figure out what it meant. Like, as you can see there's just not many hospitals in Massachusetts. And if you stratify them and you say these kind of things. This doesn't tell me anything. Like, I have no idea how to compare these. And see they separated all these. You know this is CLABSI, this is CAUTI. There's no way to tell anything. There's no way to compare these. And so, um, so here now you can compare them like we just, we, I don't know we just looked at just basically the raw rates. We're missing a lot of data. So this isn't very accurate because the NHSN is just really inaccurate. But this at least lets you compare, do some sort of comparison. And I would prefer to have the raw data and do a better job of saying which ones were lacking data or whatever. But this is what you can do when you scrape data but I felt okay scraping the data because it's our data it's NHSN. And so that was totally annoying. So, the problem with scraping data is it can be logistically challenging and like I said it can be illegal. Like, if you go to ahd.com, this is a really good database. It's American hospital directory and it's it's a for profit company. And what they do so it's every quarter hospitals that receive payment from Medicare have to submit quarterly data to the government and about the hospital. And you can buy that data from the government raw, or you can pay ahd.com and and be able to log in and they've got that all set up in like, like you can you can download the data and analyze this all hooked up together it's really nice. And the reason why I haven't ever really gotten into this is because it's mostly used for like people doing cost analysis because that's what's really good in there is looking at costs of like procedures across the US. They allow you they they have this free state national stats. And when I was teaching my statistics class I often use this data, you know, so first of all, you know I know the data about hospitals as public data like that we there's nothing private in here, but I would go in and I'd say okay here's Massachusetts, and I'd be teaching them. Like for instance I was teaching sampling so I said okay, why don't you do systematic sampling of this. Well, you know, if you if I take information like this from American hospital directory for some some some of these hospitals. And I look at it like if you look at each of these. Like you're based in a little hospital in Westfield, you'll see it's it's one of these reports, right. And it's got all this. You know, this is just not something you should scrape right and they also say it's illegal like if they find that you're scraping their data set, they'll shut you down. But I really just don't think that they get mad at my statistics class, which would go in and ask a few questions and look at a little bit of data about the hospitals they're familiar with. And so, right now there's not like, like I don't know what the line is I don't know if I looked up a bunch of data on there if they get mad I know if you do a bunch of queries in there they'll shut you down they'll say you're not allowed to do any queries for a while. But this is sort of a gray area right now. And when it comes to just like when when do when can you get away with scraping when you can you not get away with scraping because, you know, the NHSN is the government. If it wasn't, it would be like okay the NHSN did all this work to put this data together and now we're scraping and stealing it. So that's the whole idea. All right. So let's see here and so. So let's just try to think of some structured data that might be displayed on the internet that could help you with your topic and Mika, you know we can talk about this because there is a lot of data on the internet about drug policy. You know, and so, like, like, for example, you know, actually, let's just, let's just talk about, like, see if there's any sort of database of like, so one of the things that we had been talking about is fraud and I was thinking, you know, there are actual fraud cases that happens and places are adjudicated that they did fraud. Right. So maybe there are, like, there's some data collection that can happen about fraud cases right like pharmacy fraud cases. Just so we better understand like what is actually happening out there and pharmacies fraud so let's see your pharmacy fraud cases in the United States. Let's see here. So here, HHS sometimes will have. Let's see here. Yeah, see this fraud enforcement actions. I remember this from NIH NIH had like these notice or it still has this notice of scientific misconduct. NLT it's a note a specific type of notice. And so if you want to say scientific misconduct you usually collect all those notices together. So let me see here. This is enforcement actions. Okay. Yeah, so here's our enforcement actions. Let's look at here. Let's see here. I'm just curious about grand flood self disclosures. Okay, well that's interesting. All right, let's see if we do. These are self disclosures. You know, child support here. I don't even know. So this is another thing. Like where's all the where's the information about what's even means right, like, where am I going to look, you know, is it is it this stipulated penalties. Yeah, here, like prosthetics like this might even be in here. You know, and I don't even have like, there's no way to search like let's hear farm. You know, Medicare to safeguard against promise. This isn't even on their server. This on their server. Oh, I guess. Apply to party. What does this say. He made a cure tools to safeguard against pharmacy fraud and never hope we really do not apply to party. Does that make any sense to you, like that sentence doesn't make sense to know. I was like, I remember when I was living in Tampa, Mika, there was a news article that said. There was just a rule passed by the judge that says houses inspected by the city of Tampa after their soul, or as part of the selling and approved that that doesn't mean that they're approved. Like that kind of reads like this, right. Um, so my car was paid 168 billion for drugs and for 46.8 million Medicare care anniversaries in 2018 despite a size party does not have the same protections against pharmacy fraud that other parts of Medicare have. OIG has a longstanding concern about pharmacy related fraud and inappropriate billing and party. This issue brief is another step in OIG's largest strategy to fight this fraud. Oh, so basically what this article is about is about the gaps in the law where fraud occurs. So you can look for those gaps. Interesting. So I guess the answer is the reason why I didn't find any pharmacy here or maybe we could find some if we look, but it's because it's really hard to do, because there's just holes in the law. But unfortunately, let me see if there's a better database here quick because sometimes. I'm going to show you this totally unrelated database, just because it's a great example. Okay, so I've been following this database since like for 20 years now, since Iraq was invaded by the US. So a rock body count was started by these people who started to notice if you if you remember the beginning of the invasion I remember it because I was very against it. The press was not really allowed there. So we knew what was going on with the mill, where our own military but we really didn't know how many people were getting killed. And so this group of people started this where they would review the news for and they would try to categorize how many people like died from news reports from public reports. And this is super sophisticated now, right. And the problem. This has been written about extensively now, you know, can you do the citizen data collection. You know, can you can you really get data from reports can you really do this kind of epidemiology from, you know, media. And the answer is pretty much, it's kind of better than nothing. It's that's about it. It's better than nothing it's not really great, but it's better than nothing. What I feel like has happened, and I've seen this more recently, like I said this is a 20 year old project. I've seen more of these kind of things popping up because it's just easier to make a Google, like, like Excel sheet and share it and stuff and people can, you know, record on it. But people have started doing things with like keep a database of things like that, like, I can't remember the exact news article but there were some women at a workplace that they were keeping a database of the guys who sexually harassed them. You know, and we're adding data to it at all. And so, and so you can, you know, I was thinking maybe there are people on like citizen data scientists who are collecting data about pharmacy fraud, or about fraud on there, and you might find something like that. But again, that's kind of in this category of like, of make your own data to some degree or maybe more like hustling around and looking for. You know, like, you don't even know what you're going to find like, I found this one. Tesla, deaf, deaf database. Like there's this Tesla deaf database here. Can you believe it? Isn't that fun? Oh, wow. So that's what I'm saying is in the so okay. So let's say like what's awesome because you found some data that I think it's going to work for you. But let's say that you did. You can always like let's say that you found a database. Like instead of this with Tesla that this was like overdose deaths or something like that. Maybe you could just use maybe if the data is so clean and so easy maybe you can just use it for a quick project that relates to what you're doing. Because so what will happen is this is about going down the rabbit hole. And sometimes this is why it's good to keep your requirements in mind. It's because sometimes you might just say, Hey, I found this data set. Like let's say I found a data set that was about pharmacies running out of insulin. Okay, that's good. All allies my data set of pharmacies running out of insulin. You know what I mean? And so you kind of have to get pragmatic about it, but you don't have to make it because you found really good data. I say that. But there's always the possibility that you download the data and it doesn't work. So what do I mean by that? Let's say you come up with a plan you're going to analyze some California data, whatever. And then when you download it and you look, you're, let's say you're looking for some drugs that you know people taken in California like hypertension drugs like drugs that are really popular. And they're not in there. And you see a lot of weird drugs in there like a lot of vancomycin or valium like old drugs that people don't use. Then you just go, Okay, I'm going to just back away from the same set, you know, like, I remember. So, Florida has an agency called OCCA the agency for healthcare. I don't know what it is. And they, they collect a lot of data, claims data, and associate of mine wrote a protocol he got some of that data. And the data he pulled happened to be claims for surgeries because he did surgeries. And there was a complications flag on the claim. It would be a one or a zero. And he insisted that this was such a good data point. I said on claims I don't think I would trust a complications flag for like surgical complications. And he's like, No, trust it. And what I did was I found there was some there I found claims where there were. It was clear there was a complication of the surgery because it was some procedure I worked with them I'm like okay, what would a claim look like if you had a complication and so we found some claims that we were sure there were complications, and that flag wasn't on it. And I was like, Okay, you cannot use this flag, right. And so, so you just got to be careful, like, like, like, it's OCCA, and it's good data and it's good claims data, but why would you trust a complication flag and claims data anyway, right, like you just got to be careful even if you get good data. So, let's see here so I was towards the end here. So now we're at the end of our workshop here and I'm so happy Mika has data. So just in conclusion, there are three main ways, you know, in 2023 I guess if I do this in a year they'll be different. But there's three main ways to find data. So you can shop for data and look in official data repositories. I would I would put the Kegel data that people find in that category of one, which is, remember, not that useful. It's useful if you want to like practice displaying it on a dashboard or practice building reports. But if you really want to just like find something out about the data. Often repositories have data that everybody's already trampled all over. So what you find yourself doing is hustling for data, looking for data sets available on the internet that are not part of a repository or like in Mika's case that are are technically part of government agencies repository, but they're buried. And so you are if you're looking for government data. So what was smart about what Mika did was that she was she knew the government data existed and she knew it should be available, and she just kept looking until she found it. And so I've had that experience to where I'll just look and look and look and look and look and then finally find it. And so if you have a job or you've had a job where you know about this you know about certain data you know what's on the internet you just have to find it, then keep looking. But if you don't, then you also still have to keep looking because you just don't know what's out there and every day people post stuff. So we had a workshop on GitHub, I'm sure people post data on GitHub. The problem is GitHub is not like an official, like data repository. So doesn't force you to put documentation and all that on there. And so who knows what you're using. In the end, I like to make my own data, and you can do that. But you have to be careful if you're going to scrape it their ethical considerations and legal considerations. And if you're going to just collect data off of the internet. That's abstraction. And so you really want to be careful that you can replicate it and that you have a good method so that you can analyze it and interpret it when you're done. So we'll end the workshop and I'll post this for Beth and Sakeed and hopefully then they can go shopping for data just like we did. So good job. With your data detection, you're a great data detective. So like Encyclopedia Brown, you got your data. So great job. All right, well thanks for showing up today and I hope you have a good weekend. Thank you for watching this video, which is part of the Public Health to Data Science rebrand program. If you are interested in joining the program, please sign up for a 30 minute Zoom interview using the link in the description.