 Hi everyone, welcome. I think everyone is in from the waiting room. Thank you for those of you who arrived early. We will get started in about two or three minutes. But while we're just waiting for folks to log in, find that link, click on it, make sure everything's set up with their zoom. I just want to go over a little bit of our sort of interaction ground rules for today. So you are all muted upon entry. And if you are not speaking, I invite you to stay on mute just because we have kids and dogs and helicopters and trucks and all that good stuff out in the world. It would be helpful for me as your speaker. If you stay on mute when you're not speaking that being said, I think we're a small enough group. Oh, and I see some some folks that I know that's nice. We're a small enough group that I think we can be interactive. So I invite you to come off mute and interject, ask a question. Feel free to say, Hey, Joey, hang on just a second before you move forward. I'm having trouble understanding that step. Please don't be shy, because I think if you ask a question, you could be helping someone else. I do also have the chat window open and a another monitor. So I will try and look over and keep an eye on the chat, but I can't promise to be super spry with that. This session is being recorded. So if you prefer not to be on the recording, please stay muted both in audio and video, and that way you can preserve your anonymity. And I'll be happy to just read questions allowed from chat. So with that, let's go ahead and just jump in and get started. I'll go ahead and do some of my intro slides so that folks who come in won't miss the bulk of the really important stuff. So you're here today. I hope you're in the right session. This is using public data and maps for powerful data visualizations. And I am your host and your speaker today, Joy Payton. Let's see if I can get my slide deck doing what I want it to do. Perfect. So about this session. First of all, I have some preconceived notions that I want to share with you, because these preconceived notions are going to give shape to the way I talk about these topics. And if you share these, then this is going to be a really great session. And if you have some disagreement, just be aware that we may need to have your opinion as well and invite you to say hey I have a little bit of a different opinion about that. So one of my preconceived notions here is that geospatial data is useful to clinicians and biomedical researchers, folks like you here at our medicine. And I think that geospatial data is important to you because whether we're talking about patients or human subjects. When we see children in, excuse me, I'm from a children's hospital so I always think of children. When we see human subjects or patients in the lab, or in our clinic, that's only a tiny sliver of their lived experience. So understanding where they live, work, go to school, travel, etc. could be useful to you as a clinician or a researcher. And along with that neighborhood data is also useful to you. So you may not know the income of your patient's family, but you can know the income of the neighborhood they live in which can give you a little bit of a proxy idea of maybe some of the Gestalt that your patient or your research subject finds themselves in. I believe that maps are a great way to share findings with all kinds of stakeholders including policymakers. I'm going to speak about maps as idiom and why they're so powerful. And most importantly, I believe you can do a geospatial project just in our without some of those expensive tools. So this is the overview, a little bit to our session today. This is our itinerary. So we're going to interlace some breaks. So we're going to start by learning a little bit about API's. Some of you work with API's frequently some of you have heard that acronym and aren't sure what it means. But we will talk about. I have to remind myself like frittata, so crada, so crada public data portals, which is, this is a company so crada is a company that does a lot of government and other public data portals. Then we'll work on maps and geospatial data. Then section four is a little US centric. So apologies for that for those of you who do not have an interest in US data, but the information may still be useful to you. And then we may have a fifth section if time permits. Importantly, this slide deck is available for you at our pubs. And let me just copy this and drop it into chat so that you have this. All right, so this is the our pubs. Thank you so much, Beth. So this is the presentation. Ray Raymond asks, which are marked down holds these slides. The arm rock down that holds these slides is in a separate repository than the repository that was shared with you for the materials. That being said, if you're interested in cloning this presentation, all you have to do is put dash presentation on the URL of that GitHub repository. So there's a GitHub repository that's our medicine 2022. If you make that our medicine 2022 presentation that will take you to these presentation slides, but you should have everything you need at the our pubs URL that I just posted into chat. Briefly about me. So, who am I, I should have said what my affiliation is, I am a data educator at the Children's Hospital of Philadelphia. I am, I have no conflicts of interest to surface. However, I like to joke I'm always looking for conflicts of interest. So if you have an interesting one, look me up. I'm a data scientist and I'm a data educator so I know a little about a lot. I am a big advocate for research responsibility and rigor because I work principally with researchers, some of whom are clinicians and some of whom are not. I'm a big census nerd that was my first job after undergraduate was I was actually a geocoder trying to figure out where in actual space certain addresses were located. I'm a big fan of maps and I'm a political junkie, which ties into maps and census and a lot of other things. What am I not, I am not a GIS wizard you are not going to hear about Esri products or Arc GIS or other really heavy duty GIS topics in this session. I am not a demographer I'm not a physician and I'm not a statistician. So please hold, hold me accountable only for the expertise I profess and be gentle with the skills that I don't have. To reach me, I'm available on LinkedIn. You can also make an issue on the repo for, I say this presentation but what I really mean is this workshop. And you can use my chop email address. This document is on the web. So that's why I have Peyton K at chop.edu written out that way just to slow down the email harvesters, but I will be happy to hear from you. So with that, I think we've got a full house. Oh, wow, a lot more people than I expected this is fantastic. We're going to start with section one, which is API eyes. So what are API eyes, why are they important, that's what we're going to talk about in this section, you're going to be exposed to several different API eyes throughout the course of this workshop. So an introduction to what the heck these things are is in order. But before we get started. I think the most important thing for you to do is to clone the materials for the workshop to your own computer. So what I will do is actually let me grab a new tab. And I am going to go to GitHub. See here. All right, let me grab that URL first of all and put it in chat so that everyone can have it. And then let me pull this tab here. All right so this is our medicine 2022. If you did want the presentation slides, you can just do dash presentation, and that'll be the presentation slides but you really don't need that to get the full effect. But if you go to our dash medicine dash 2022 here on that URL. This is a GitHub repository. And if you are unfamiliar with GitHub, don't worry. There's a couple of solutions for you. You can, what I'd like for you to do is click the green code button and in this slide you can see where I click the code button and a couple of things open up here so you can, if you use get regularly, you know how to clone a repo to your own computer. And I invite you to do that either with GitHub desktop if that's your tool of choice, or by using the link to be able to set up a clone in your own environment. If you don't know what any of the heck that is just click on download zip, and you can get a zipped up version of these files. So I'll give everyone a chance to do that. And if you do not. Let's say you don't have our or our studio installed on the computer you're on. And you want to work with these files and you want to participate in the workshop, but you don't have the tools, or we get to a place in the workshop where you have a dependency that you can't install. And your computer is governed by your workplace or your university, and you cannot install what you need to install and you're frustrated and shaking your fist have no fear there is a. Oops, let me get rid of that idol. This is an R studio dot cloud version of this product. If you have an R studio dot cloud account, or you want to set one up it is free of charge for the entry level. So there is a version of these files, and you can make a copy of this project and save it in your own R studio R studio dot cloud account. So I'm just going to give folks one more minute to go ahead and clone these materials, I know you did get an email, but we're all very busy. So go ahead and do that. And once you've done that, you need to create another folder. And this folder will be called private. And this folder. Oops, I'm being joined by my cat. Let's see if I can get her off my lap. She's a big fan of R. The private folder will hold at least one API key depending on your uses and how far you want to experiment you may store other API keys in that folder. So once you get these files downloaded and possibly unzipped, then I would like for you to add a folder called private. And I will explain what to do with this when the time comes. Right, I'm looking at the chat and I don't see any signs of alarm. So let's go ahead and move on with what is an API. So what you have here is a screenshot from one of the apis provided by the New York Times and the New York Times has many apis. This is one from the best seller. It's their books API. So if you are familiar if you're a big reader, you're probably, I have heard of the New York Times best seller list. And there are some instructions here and this is the very tippy top of a very long document that explains how to use their API and how to shape the URL or the web address to get the specific information you are interested in. So a lot of different websites from PubMed to Twitter to the New York Times and many, many others will have a page like this that explains how to use their API. So API stands for application programming interface. And what it is, is a way for people or computers, really an application to interact in a prescribed formulaic way with a computer that's listening for these requests, right. And there's a couple of different flavors of apis that are out there at the most popular one and the one you will pretty much universally use at this point is what's called a REST REST API. And this is essentially a resource oriented API so that you use the URL or the web address, the thing you type into your web browser to describe in detail the precise resource that you want to access. And for our purposes today, we're going to just talk about downloading. However, there are APIs that allow you to upload data as well. But today we're just going to be talking about accessing public data and pulling that data down into your R scripts. So why would you use an API? Why not just Google the New York Times bestsellers, right. They print it in a web page you can easily look for any given date what were the New York Times bestsellers for, you know, January 17. And you can just copy and paste it, right. So why go to the trouble of learning an API. It seems like a lot of work to go into to figure out how to structure a URL. The thing about APIs that make them really powerful is that they define a specific structure for input for describing what you're looking for, and you get a reliable output that is always the same. So if you compare that to a human driven click point and click mechanism, if you ask two people to go find the New York Times bestsellers for a specific date, they'll do it in two different ways. And they'll click on different things and that they might get two different answers potentially, you know, depending on if they got everything right and they did all the steps correctly. What's really important is that APIs will give you fresh data. And an example of this is, let's say you're doing a lit review. And it's happened to me I'm sure it's happened to you you're doing a lit review for publication. And then two months later, you're still haven't moved and you've got to sort of redo your lit review and see what's happened since you last started your it sure would be nice instead of just sort of going to Google scholar or going to PubMed and using the search box sure would be nice to sort of automate this so that you can say, you know what were the the journal pieces that came out in 2018, 2019, 2020, etc. for these topics. Well PubMed has a nice API. They call it entree or entrez I'm not sure how they pronounce it. And this screenshot is from their API documentation. If you've ever used the red cap research capture software, you may have used their API, which is a great way instead of having to always download CSVs and make sure that you change the name because it has the date in the name that you're always getting the very freshest data by doing a fresh API call within your R script every single time. So that whenever you run that script, you're getting the most fresh data. That's another benefit of using APIs. And most of all the alternative to using API is is pretty bad. Right. So if you think about just sort of writing a documentation section for, you know, someone who works in your lab or, you know, a student work or something like that to get data. So this sort of human executed punch list is very error prone. It's tedious to record it is prone to change. You know, maybe the checkbox that was on the left is now on the right and you have to get a new screenshot to describe how to do this thing in a manual way. So, you know, all of these are sort of the benefits of using APIs. So let's talk about restful or rest APIs and how they use URLs. So you have almost certainly seen URLs like this one. This is a real URL from where I was searching for a particular text on Amazon. You can see that after the Amazon web address, there's a question mark, and there's a bunch of characters that seem to indicate something you see are for data side in there, you know, this, this is clearly some sort of book search. So let's break this down and look at these various values. There are keys and values key value pairs within this query string. And you see, so I know that the query string itself may be sort of small. But I sort of broke down in the bulleted list what the different keys are and their corresponding values. So this K and K equals r plus four plus data plus science. CR ID is equal to this alphanumeric string. There's an S prefix, which is equal to r plus four plus data plus side, etc, etc. And then there's a ref. So we have four key value pairs. That's true of all query strings. This is a feature of sort of the HTTP protocol that a query string will start with the question mark, and then it will be followed by key value pairs with the format, the key name, an equal sign and the value. We have spaces in a URL. And so that's why you'll see things like pluses or maybe a percent 20 to indicate spaces and to join various key value pairs together, we use an ampersand. And that's how we can string various things together. Alright, I'm going to pause there and see if there's any questions about sort of, you know, why are API is useful compared to point and click computing. Sort of the point of API is any using URLs as a way to interact with an API. And we're going to do some concrete hands on work, starting in our next slide. Okay, great. So in the materials that you downloaded for this course from that GitHub repo. There is a folder called scripts and inside scripts. There's a pub med API example dot rmd file. This is a great way to start because PubMed, like some APIs has a sort of free tier and anonymous tier, where you don't have to have any API key or credential. You can just be an anonymous person and pass them a link and get information back from their API. It is limited to only a certain amount of data. So, and I've seen weird things happen like if you're on the same IP zone as a bunch of other people on this call, it could be possible that PubMed is like whoa, whoa, whoa, there's a lot of people hitting the API simultaneously. So we might run into some things depending on how many folks at your institution are trying to do this simultaneously. But I think this is a great way for us to get started because a lot of us have professional interest in PubMed. So this could just be useful. So if you can't or don't want to actually run this code, then it is also available in our pubs at this URL which I will copy and paste. I invite you to go ahead and open that. Alright, and I'll go into my RStudio cloud and go into scripts and go into PubMed example as well. If you're using RStudio cloud, you will have a number of hey you need to install this please install it should install no problem. And we can just sort of run through this. So I've got a little bit of a blurb the first R markdown of the workshop was a little blurb on R markdown I'm not going to bother reading that to you. I leave that as an exercise for the reader. We sort of load some very industry standard packages tidy verse, which everyone knows and loves easy pub med, which is going to make the construction of that query string a little bit smoother and you don't have to actually construct the query string this library will do it for you. And print are because I like the way it makes things look. Alright. So, what I'd like to do is figure out when I say 2020 in here but then I use 2015 so I'll fix that in post. But I'll just, I'll just say 2020. Alright, let's just let's leave it as 2015 and I'll change this 2015. Please don't bother correcting this in your version. This is just me being persnickety. So what I'm going to do is pass this search string. This search string is the way that pub med specifically likes its data to be structured different apis will have different rules of engagement. And this is the way that PubMed specifically likes it to be done. So, I have this string, which has the date and I say in brackets. That's the date of publication please. And then I have two keywords medicine and disparity. And I would like to get the results for that. So let me run that. And this gives me a bunch of information here tells me how many pub med IDs were found that meet those categories. And it will return 20 of them at a time so I can paginate through the results we're not going to do that now. But just to so you know what these are. Alright, so these, and then these are the actual IDs of the different pub med the first 20 pub med publications that it returns. Alright, and then some more metadata. If I wanted I could extract just account of the articles and just say alright that 600 perfect. But let's say I have I'm doing a lit review and what I really want to look back at is sort of the evolution of the understanding of disparities in medicine over the past few years. And I won't go into sort of the deep how to, you know, feel free to read this. Basically what I'm doing is I'm going to set some years 2012 to 2022. So stopping just before 2023. Three sets of terms medicine disparity medicine racism medicine racial bias. And then I'm going to combine those in a combinatoric to get search terms. Alright, and if I look at search terms. You can see it goes 2012 123131232014 etc. So it combines those two. Alright, that's what it looks like. And now I'm just going to put it in that format that PubMed likes. Let's take a peek at it. So if I move over here, I can see that's the glued together query that I'm going to pass. I'm going to make a function that just grabs the count because I'm really just interested in the shared number of articles that I find. So that's what I'm going to do is just loop through each of those terms. And because this is the, this is the anonymous API, I'm going to sleep for a half second between each one. And that is, that's polite, and it will keep the API from kicking you out due to being too greedy. So this is something if you work with API is frequently. And I've done this before where I'm making some small changes to my R script and then I rerun it and rerun it and rerun it and then I get my hand slapped. And it says hey you're done for the day. You need to step away you've gone over your limits. So that's just something to be aware of just being polite and not hammering a server with a bunch of requests so we're just going to do this slowly. So this will stay green for a while as it's running. But I'll go ahead and run the next cell which is going to be the visualization. And it went ahead and visualized it. So I've got this nice GG plot of what are what is the sort of trend in medicine disparity medicine racial bias medicine racism, and number of articles and 2022 is obviously not complete. But it seems on course to continue the upward trend. So, and it's interesting to see there does appear to be a spike in 2020 around the uprising after the murder of George Floyd. So that sort of seems to be reflected in the trend here. So this is just an example of how you can use an API, and specifically how you can use an API in something that might actually be super useful to you which is doing a reproducible, automatable lit review, right and figuring out like we've all had to do that thing where we describe how we looked for articles and what search terms we used. So this could be a way to do that potentially with more ease. So I'm going to pause here because our next slide we're going to be starting section to after section to we will take a break. But I just want to pause here and see if there's any questions about that PubMed example or apis or other things that that I've talked about. Again, if you came in a little bit later, you may take yourself off mute and jump in or you can put something in chat and I'll be happy to read it aloud. Okay, please go ahead, you have a question. I should say I also don't have the thing pulled up where I can see people's raised hands. So I'm not trying to ignore you. I just don't have that pulled up in front of me. How do you find the possible search terms how could I download abstracts. Alright, so a couple of couple of this is a really good point and so, and we can abstract this to say for any given API, how do I use it. Right. So we're talking about two questions here. One is, how do I use the API in other words, how does PubMed make its data expose its data for use. Right. And the other question is, well, some people wrote in our package to interact with that API. And those people may or may not be staff on PubMed they may be just average people like you and me who said, you know what, I do this all the time I'm going to make a package to make it easier for you. So let's sort of separate out those two questions. One is how do I interact with a given API. So I like to say that what I'm good at is, but I'm good at is really knowing how to Google things. So I am going to Google. Let's see what am I going to Google here I'm going to Google NIH. No, no PubMed PubMed API. All right, and I'm just going to pull that. Right back in here. So you can say this is purple because I've definitely used that. So I'm going to click here. So this is just me sort of doing a first, you know, glance at how can I do this. So the first thing that I found here is the utility so the entree or interest I'm not sure how they pronounce it system that allows you to interact with PubMed. And they have a pretty good documentation section. So if you go there. There's a couple of, you know, there's a whole manual here. There's an introduction that you can look Oh that's a video I don't want to do that. Sorry, sorry. Nope. There's also this quick start. And you can see this has been around for over 10 years. And you can see, well here's how to search here's a basic search. Storing search results, etc, etc. Right. So you might want to say, you know, instead of Googling that maybe I would Google and PubMed. API abstract. All right. So let's pull that text mining web APIs. All right. So literally I would just start looking API for PubMed central open access and bio C format. API for PubMed, etc. So I would just sort of like just look around in here and figure out how do I get the abstract right so I don't know the answer right off the cuff. But if you just use the Google machine, that can help. Now the other question is, what do I do with, or how can I use easy PubMed more easily. Thank you Jonathan. Search PubMed and are to get a nice help document. All right. So one thing I'm going to do is say easy PubMed. And two question marks is usually how you search for help on a package whereas one question mark is how you search for a topic that's just a single function. And so it looks like, and I'm not sure who this Damiano Fontini is. They do not have a NIH or, you know, PubMed or National Library of Medicine email addresses their personal address. So I'm guessing this is a scientist or somebody like you and me, who wants to put this together. And so there's some, you know, examples here of how to do that and look. Format equals abstract. So it looks like maybe even this first example might be something that would be helpful to someone trying to get the abstract. Good questions everyone. All right, well with that, let's go let me find my presentation. And let's go to our next topic which is public data portals. So by public data portal. I'm really referring to a specific kind of place where data is stored. And not all data that you might be interested in is stored in a portal, right, a place that is designed for data download it's really intended for that use, etc. You may have to download you may have to like scrape an HTML table from some government website, right, or look at someone's PDF and, you know, get the data that way. I'm not really talking about about that in today's workshop what we're really talking about our public data portals. And these public data portals have API's. So how do you find a public data portal. There's been a lot of movement in the open data sort of movement, both for scientists and just for, you know, FOIA freedom of information and citizen activists who want access to data. So the easiest way to find public data is just to search for, you know, whatever your term is, along with open data. So I search here for open data gun violence New York City. These are a few of the responses or results that show up for me. And let's take a look at the three. So the first site that I found. And this is me googling as myself. So who knows, right if this is somehow shaped by my previous history of searches, but my first result. I have a lot of statistics when I click on that. It turns out that it's got a lot of PDFs that I could copy and paste a table from but that's not really the kind of raw data that I'm interested in researching. I'm really interested in doing some sort of work about looking about how let's say if kids live in a neighborhood with high amounts of gun violence, maybe they get less exercise so I'm doing some sort of exercise intervention or I'm doing some sort of, you know, analysis of kids and physical activity based on their neighborhood characteristics. So the second website is much better and if we look at the second website it begins data city of New York dot us. So this has a lot more going for it has data in the URL and it ends in dot us so it seems like this might be an interesting government website. So, when I searched for New York City gun data gun violence data. It took me to this this was the second result that I said before oh this is a good one. However, this was not actually a really great data set, but I want to show it to you anyway, the things that we see here. One thing that you'll notice is in NYC open data and a lot of the open data portals have this feature. There are community interactions that are available. Right so you can as a community member say I have looked at this data and I've created an artifact, either a data set or visualization and I want to share that with other users of this data. And that's what this looks like because there's that community tag that I've marked in that orange arrow. And this community artifact is a view. If you look at the top orange arrow it is a view based on a different data set. All right, so the moral of the story is, even the best data portals are going to have data that is not useful, at least not useful to you. But if we look at the data set which the original teeny tiny 11 row data set referred to. It's called NYPD complaint data current year to date. And this is actually a misnomer. This was named in 2018 or 2019, I think, but it actually contains data up to and including at currently it is updated quarterly. And you can sort of see there's some really great data set information in the about this data set. And this is another marker of a great portal is you can figure out who is responsible for this data what agency collected it, how often do they update it. How big is it, how many rows are there in this case there's 257,000 rows, and there are 36 columns. And the columns are listed below in the columns in this data set and I know this is kind of small but I just wanted to give you the overview of the whole website so you can you can take a look at it. Now let me just go back and say, if you look beside the title of the data set. So you can see there's a sort of a toolbar of buttons there's a dark blue button that says if you data. And next to that there's one that says visualize and then export and then API and not surprisingly we're interested in API. So, so again. This company is known. Socrata is, or, let's see, for Tata. Socrata. Socrata is, is known for doing these government data sets, or data portals. And so they have a common API that is the same across all these different portals. So if you're looking at a portal in Germany or a portal in New Jersey, the same API rules kind of come into play because it is all the same company. So in our next practical exercise, what we're going to do is look at some interesting data available through soda, this open data API. And I'll actually have you download some of this data, and we're going to create a really sort of quick and easy map and we're not going to explain it much. But I want to give you an easy win and make sure that you get that that map accomplished. There's some API documentation, and there's a developer portal, and there's an API endpoint, and you can see the API endpoint there says JSON JSO when JavaScript object notation. That is generally not what you would typically use in our you would probably look for a CSV if you were looking for tabular data. So we're talking about geographic data here. So you may also be interested in looking for geo JSON. So CSV or geo JSON, and we can, we'll get into these various file types as well. Later on. All right, so we're going to get started specifically with these data portals. Now not every public data portal uses the soda API. There's a lot of companies in this vertical in this space, and my own city Philadelphia open data Philly does not use this API. It is on a different probably. I imagine there's a few companies in this space and they have different packages and they're different, you know, some of them are expensive and some of them are less expensive etc. So we'll talk about just sort of how to work with arbitrary API's, but this, this API is so ubiquitous, I think it really merits a little bit more attention. So, what we're going to do next is we're going to take a look at the various data portals that are available so let me just type this into chat dev.socrata.com slash data. So, take a chance, take a moment and just sort of look at that and just sort of search for some some data that you might find interesting. So you might want to search by geography, like what is available for Iowa, or what is available for Venezuela, or whatever area of the world you are in. So if you're interested in a condition, a health care condition or a social condition, or a habit like smoking, or a public health topic, or prevalence of an infectious disease, lots of different things to look for. So you should get a list of outputs that looks that each one of them looks something like the one I showed, I show on this slide. So I searched for, I think I searched for child abuse. And I found this Iowa child abuse occurrences data set. And there were a bunch of other data sets but I said, you know, let's, I want to focus on this one. This is a very likely data set. And importantly, when I say the word likely, what I want you to think about is, does this data set have geographic features. Right, does this data include something that could be mapped, like points, points where a shooting happened, or points of a hepatitis a incidents at a at a restaurant, right. It could be point data or it could be what we call polygon or shape data like which county has the which one of the prevalence is for various infectious disease in the counties of North Dakota. Right. When you find a likely candidate data set, click on the green API button. And that will give you more detail about that data set. And so when I clicked on the API button for my Iowa child abuse by county data set. I got a big long page, and I only chose just a tiny little bit to screenshot here that describes information about this data set. And importantly, you'll get some information about the data set. But you'll also get information about how to work with the API, because every API is a little different. So this API does not work the same way as the New York Times or PubMed it has its own flavor. There's some information here about filters, as well as so cool as so QL or as I'm presuming it's pronounced so so cool queries, which are like sequel, right there intended to be like SQL or like SQL. So you could use those and there's links to that. So that's a really great feature of searching for data. That Sakura that Sakura yes provides is it'll give you information about the data, as well as how to interact with the API. So if you have a data set that you think you want to work with because it's interesting to you. And it's apropos to your interest. That's fantastic. And if you don't, that's okay because we're going to use one from New York City. So, once advanced. Let's go. Alright, if you do not yet have a data set that you want to work with. We're going to use and you don't have to copy this down it's going to be in the R code that we're going to use. So I will not post that in chat that long URL is going to be available to you. Actually, is that true. I'm not certain that that's true. Hang on. Let's do this. This is the web page for that data set. And again, you can take a look. Let me copy this and put it into chat. If you have not found a data set that you want to use the data sets from for Milwaukee. I actually list Austin Texas Metro in the API address. Uh oh. So, there's human error all over the place, including in data catalogs and things like that. So, sounds like that is probably a mistake. That being said, in this New York. Open data resource. One of the things that I like to use with students grad students is look at 311 data. So in the United States and a number of large cities, you can dial 311 and that will put you in contact with a lot of city. Agencies from hey my garbage didn't get picked up to my neighbors are having a loud party to there's a dead bird on the sidewalk right all these various things that can happen. You can call 311. And when you look at 311 data from New York, because you can report it from anywhere, there are reports from all over. So, you know, 99% are reports from New York City, but somebody, you know, might be a tourist who downloads the 311 app while they're in New York, but they're from Minnesota, right and they're reporting something. So, these are some weird artifacts in data as well. But again there's lots of metadata here which is really useful. Oh, this is the one here let's go to this one. This is the one I think we want I apologize let me do this. This is the one we want. This is the larger data set that has 257,000 rows and 36 columns and we can sort of see what this data is about. And there's even a preview of what the data looks like. So, you can see sort of the date something about complaint adder perks CD so that's probably some sort of precinct PCT I'm guessing the borough name. There are five boroughs in New York City at the complaint FR DT so I'm not sure what FR is but DT is probably date time. And then we've got a data dictionary up here that sort of gives that information. Right. So lots of great information here. And there's this handy API button. And if we click on this file type we can see there's JSON but there's also this CSV, and importantly, geo JSON as well. All right, so we're going to work with that data, unless you have data that you want to try working with. And if you do, we're going to have enough time to work on this. I'm going to give you a big chunk of time so that you can sort of riff on this and find data that you find useful and play with it. And then we'll come back. So I'll give you some time to work and break time sort of combined. So let me go back to my Let's see here. Let's go back. Okay. All right, so what we're going to do is use another file in your scripts directory. And this one is called simple maps from Sakura Sakura I keep saying it the wrong way. I want to say, Sakura like Socrates, but I believe is, I keep saying it rhymes with Hakuna Matata and for Tata, it's Sakura, so simple maps from Sakura dot rmd. So please go ahead and open that. Follow the instructions there. If you want to just use it as is, that is completely acceptable. You're going to look at some data that I've chosen that I think is interesting and fun to look at. But I encourage you to be a little gutsy and find some data of your own using their data catalog and take a look at it. So that is the rendered version available at our pubs. Again, it's that same our pubs.com slash PM zero page AP that's my handle. And it's called a simple maps from Socrates. So let me do this and paste that. So that is the rendered version. So what I'm going to do is go ahead and wake up my workspace and go to my files. And go ahead and open up this arm rock down. You're tired of hearing my voice, and you're probably itching to get your hands and some code. So I'm not going to walk through this. What I'm going to do it is five minutes until the hour. So what I like to do is give, let's say, let's say, when they'll say 15 minutes. So at 10 minutes past the hour will resume. I will be here I'm going to have my camera off, and I will have my microphone muted, but I'll be here so feel free to jump in and ask questions. So in those 15 minutes you can use all 15 for work. You can use five for work and 10 for break. I give it to you to govern your own time for the next 15 minutes. And I will see you back here to begin the next section at 10, after the hour. Welcome back everybody it is 10 minutes after the hour. And hopefully you had a chance to go through the simple maps exercise. Thank you so much from Sakura dot rmd. So, just really quickly. First of all, I'll solicit. Did folks have any of the did folks have any problem with installing things with running things. I just want to solicit any sort of overall problems that folks had. Okay. The RG doll. Yeah. I'm sorry that happened to two and Andrea. So, and this is part of why, you know, our studio cloud can be helpful, because it, you know, has all the sort of standard C++ libraries that are required to allow these, these packages that right on top to work. So, G doll is the geographic data abstraction library. And, and so our G doll runs on top of that. So, you know, a lot of times you might get something that says we couldn't do our G doll because G doll is not installed. So for me on on my Mac, I have a Mac computer. And I had to install the base C++ library at the at the beginning. So, you know, in retrospect, I think the next time I lead this workshop what I'd like to do is actually kind of hold folks hands a little bit as you do that. So once you get the G doll C++ library installed and if you just search like install G doll. Install G doll on Windows, let's say, it can say like how do you do G doll and then you can get our G doll on on top of that. So apologies about that. Oh, you've got to notice that's interesting this is our G doll will be retired by the end of 2023. Oh, that is novel to me and very, that is definitely news, I can use Beth so thank you. So I imagine there will be some other abstraction library that, you know, takes over for it. Oh to SF. Yeah. Okay, so SF also sits on top of G doll so that makes sense. And when you installed a bunch of these packages SF was installed behind the scenes probably. So that's super helpful thanks for throwing that information out for us Beth. So let's just take a look and see what this looks like. So, if you were able to install our G doll and load these. So what we're doing is bringing in some data from an end point here, the API endpoint that ends with that geo JSON. And then, when we bring that geo JSON in it gives it an abstraction, such that whether we pull it from. The crada endpoint or some other endpoint where there's a geo JSON, or a shape file and as we shape file, the abstraction will provide us sort of the same structure, where we have data be the bounding box we might have a projection. There may be polygons, etc. And the nice thing about leaflet and somebody asked in the chat is leaflet like GG plot. GG plot does have some associated mapping features like GG map is one. And I'm a big fan of GG plot and general in general the tidy verse and the works of Hadley. But when I teach mapping I prefer to teach using leaflet and there's a few reasons why. So principally among them is that leaflet is a JavaScript library that is then sort of ported into our or ported into Python or whatever other sort of language you're using for data analysis. So it means you can sort of learn it once and use it anywhere, which I really like, depending on the language you're using there's a few sort of like maybe syntax differences. And I can forth between using leaflet in Python and leaflet and are all the time. So that's one benefit I, I like about leaflet and the other benefit that I like is that if you are working with stakeholders like let's say you are trying to change policy makers opinions about a public health issue. So if you want to have a dynamic clickable map that you can have on your labs website or your hospital's website leaflet is a fantastic way to create those interactive web based JavaScript based data visualizations very, very, very simply, because it is sort of the industry standard I would say in JavaScript for doing mapping. We also make static maps as well and we'll do that a little bit at the end, but it's just very multi purpose so that's why I like it. All right. So here we have a base map. A base map is essentially what are the streets and names and sort of major features that just come with a street map right so that's a base map. And we get our base map by using the add tiles line here. So we're setting the view by providing our longitude, our latitude and our zoom. So really with just sort of two lines of code, we can get a reasonably decent map of an area of interest. If the data that you are downloading is based in polygons. The polygons are included in that data and leaflet knows how to read that data so all you have to do is say add polygons and you don't have to be more specific about where they are. I didn't like this sort of standard default blue. The next chunk of code shows you how to tailor it so it's a little bit in my view more attractive this data set itself is not particularly lovely. I think these are maybe Senate districts or something like that. All right, but what if you're dealing not with polygons but with points. So this is that shooting. I'm sorry the crime database that we were looking at before. Instead of add polygons we're going to add circle markers. All right. And again, those points are well understood by leaflet, because the data has been abstracted to this common geographic data abstraction level. So you don't have to be more specific it knows where to find these points in the data. Again, I'm not crazy about that particular blue very large circle. So in the next chunk of code, I changed the way the circle markers look by changing the defaults and just give you some information to play with there. And that is a little bit more readable so that was sort of the sort of, you know, first blush at mapping. And what we'll do in the next section is take a closer look at maps and geospatial data. I did want to give folks a chance so did anyone work with some data and find something interesting or did you were you able to map some data that was interesting to you that was close to you geographically or close to your interests. So if anyone has any big wins that they want to share, I would love to invite you to put that in chat so we can celebrate with you. In the meantime, while folks are typing that. We'll go ahead and present section three maps and geospatial data. So we're going to talk about several different topics in this section. The next section are fairly lengthy, because there's just a lot of interesting material that we can do, especially since this data is very rich. And I really want us to, to get into sort of the details of mapping using leaflet in this section three. So we're going to talk about maps as an idiom, the kinds of files that you might find in the wild that have geospatial data, and then mapping with leaflet. All right. So this is a map. This is a reconstruction or a compilation of some maps that were done in the 12th century. So if you have seen this map before and you know what this map is of on your honor, don't, don't say anything. But for those of you who this is your first time seeing this map. What does this map depict. First of all, how do we even know it's a map. Right. All right, I have a Europe question mark. All right. Other Canada. Okay. North America, Middle East. All right, I'll help you. The, the center of the map, the, you know, this seam, this vertical seam. Starts at the top and passes through what is current day Saudi Arabia on its way down this scene. So the sort of large land mass in the center at the top is Saudi Arabia. South is on top, exactly right. South is on top. This is my opportunity to remind everyone that north is not up up and center is generally determined by the people making the map and the people with sort of the political and printing power. To make their data visualization when the, the contest of how these data visualizations are created so this is in fact a map of Europe and Asia and northern Africa so on the far right, we see Spain, sort of in the bottom half. So from Spain and move to the left you can sort of see Italy and it's kicking Sicily there. And then there's some, some grease, kind of popping out and you can see the Mediterranean, sort of in the, in the center bottom here. I'm sorry this is the, this is the Mediterranean. This is the black sea in the center bottom and this is the Caspian sea. Right. This is northern Africa in the upper right. Exactly so this is, this is a map that differs from modern day maps, mostly into respects. Number one, it's geographically inaccurate. Again, this is the 12th century, no satellites. And south and north are inverted to our current understanding of how maps are generally drawn. That being said, you were able to figure out what this was or make some educated guesses, because there are some idioms that have worked for the last 1000 plus years. One is that water is blue, right, water is blue. Growing things like forests here are green. There are lines that indicate roads or natural features like rivers. There are dots that represent towns or cities or other areas of interest. There are letters you see at the top. This again is a reconstruction of a 12th century map. So instead of Arabic numerals, there are Roman numerals. But there are sections of this map along the top and along the left. And you can see that they're, you know, we, this is an understandable idiom. And, you know, I say this because I think we would be hard pressed to find a data visualization idiom that is so similar after 1000 years of use. So that's one map. And this is another map. And this is a map from the 1950s of Okinawa, Japan. And you can see water is blue. All right, here we have a blue outline. We don't have a blue sort of swath of color that we have a blue outline. We have lines that show roads and rivers. We have dots. We have words. We have some additional features that give us a sort of a guide to the data visualization. But if we were to go back and forth. So you would say this is a very clear minor improvement really this this data visualization medium is very powerful and it's one of the ones that we learn first. We learn in second grade. I did I don't know in your school system when you learned it but how to read a map and how to say if I want to go from the fire station to the school I go up one. I go north one block and west two blocks and things like that. And why am I belaboring this point. I think this is important because as healthcare professionals as researchers as folks who are interested in medicine and whatever way we are. So we're dealing with professionals who are very, very experienced data visualization. We're submitting a paper to science or nature, or we're doing we're using waveforms and describing things to our colleagues and our own medical specialty, etc. Sometimes we are presenting information to stakeholders who do not have the same base of knowledge that we have. So the ability to speak to people in idioms that are clear to them is very, very powerful and I think maps are one of the the most powerful ways that we can talk about public data. So just again just to recap some elements of maps here. You know, you may be familiar if you study public health about the work of john snow in the cholera outbreak and I think London. And some of that information as well I don't include all of that here but just to keep in mind we have some some basic elements of maps that we want to keep in in mind when we're creating a data visualization we will not get to all of these in a single workshop. But the important thing is that maps are fairly easily understandable and stable over time, which is great. So when you get a map file. So there's a couple of ways that you might work with public data and maps. One is, you might get public data that does not come with geography attached. So you get some, or excuse me it does have geographic information, but it doesn't tell you maybe the latitude and longitude points, right, for example, you might get a list of counties and case numbers, and it's simply tabular data that describes each county and the various case numbers. And then you have a separate file, which is the shape of the various counties. And because the name of the county is in both of those data sets. It's fairly easy to combine those in the same way you would do either using join or merge. One way you get data is you get data that is just plain old tabular data like a CSV that includes things like county names or state names, or census tract names, and then a map file. Another way you might, you may find data and this is what we're going to work with principally today is a geo JSON or a shape file, or another type of file that has both information about a subject or topic, as well as the description of geographic features. So we're actually going to open up some of these files and see what's inside. So there are two principal kinds of mapping files. Geo JSON is one and I have the link to the standard it is very dry reading, but in case you like standards the link is there. And then shape file, so shape file is the Esri standard and Esri is the company that has Arc GIS. Again, I'm not a GIS expert. And maybe because I'm not a GIS expert. I prefer geo JSON. And this is because I come from the software development world where JSON is a very well understood data format for me. So I prefer. I prefer to use geo JSON. And I would say there are a number of people who have the same belief that I do that geo JSON is is a little bit more user friendly than shape file but we'll take a look at your mileage may vary. There's a few other sort of, I would say minor data types that have a much smaller market share, but I include those for you as well, we will not go into these today we will look specifically at geo JSON and shape files. Alright, so what we will do is we're going to take a look inside the shape files and geo JSON, and then we'll use our G doll to transform these files. Alright, so let's do this. Again, I'm going to have you open within scripts, one of these files which is opening map files that are MD. This again does exist in our pops. It is this one, let me copy and paste that so you have it. Alright, but I'd like to actually go through this with you so please go ahead. And open this either in our studio dot cloud, or in your own computer, open the opening map files dot rmd. And I will do that with you. So, the first thing that you'll notice is we have a number of potential installs here. And if you had a problem with RG doll, again, that's included here. This may be a great opportunity. If you don't feel like troubleshooting that right now like in the middle of a workshop to come over with us to our studio dot cloud and do that work there. Alright, so, you know, I say here, you know, you're going to have to install these packages. You're going to have to install G doll on your local computer, and a few things about this setup. So this setup in line 16. Actually, I think it's a little bigger. I'm sorry, I just realized this is probably small on your screen. That's probably better. So the knitter setting of options in line 16 has some things that you may not be familiar with echo equals true you've seen all the time, even if you've never clocked it it's always there. And in message equals false. I use that a lot once I'm kind of done with my are marked down and I'm confident that it does what I want it to do, because that way it turns off all the sort of informational bits and bobs that I don't care to see in my final. And my final output, but this cash equals true. This is. This is both a very helpful thing and a somewhat annoying thing. Cash equals true option is again polite. So what it will do if you are making incremental changes to your are marked down. And at the top of your are marked down you go and get data from a server. So caching that data means you don't have to hit that server every time you just fix a typo and then re render your document, or add a new visualization and then re render your document. So that cash is very useful. However, it can be somewhat irksome, because if you ever used cash before, you know that cash can be sort of sticky, right. I mean, if you're re rendering your are marked down, and you had, let's say, a GG plot this is what usually what throws me off. It's also how am I getting the same GG plot as I did before I changed three things. Why is it not showing up. Why are my changes not showing up. That is from cash being a little bit sticky. And what you'll see is a set of files. In fact, let me just go ahead and knit this. So you can see what this looks like. So as soon as I started hitting knit, I created this folder here that says opening map files underscore cash. And if cash is causing you problems. All you have to do is just delete that cash folder. And that is like refreshing your browser or deleting your cash in your web browser similar concept. But I do think it's important to point out the cash equals true because you might be using an API that actually requires you to pay. Right. So in that case, then maybe you want to make very sure that you're not asking the same thing 10 times in a row every time you re render that document. So we are going to make sure to leave cash equals true on unless it gives us a headache later on. So in terms of reproducibility, if anything ever goes wonky and this goes for all of these scripts, please tweet at me or email me or reach out to me and I'll help you figure out what's going on with the script. If RG doll, you know, when because of what's going to be sunset I'll probably change this to use SF. I'm going to talk a little bit about the structure that I expect there to be a directory called data at the same level as the scripts directory. And I've got some good information here, or I should say pause it or our studio has some great information here about making sure that you're in the workspace you expect to be either by using projects or by using working directories appropriately. All right, so up to line 38. That's all sort of background information for you. Now let's start with shape files. So shape files again these come from Esri, and they are actually groups of files. So a shape file is actually a compressed file that contains multiple files within it. And these are usually you're going to get these from large government agencies from maybe large NGOs, or nonprofits that can afford Esri software, Esri software is very expensive. Your local neighborhood public health agency is not going to probably have shape files they're probably going to use something like do Jason. There's, you know, again if you like reading standards I have the standard there. So the first thing we're going to do is download a file from the US Census Bureau, a shape file, and we're going to unzip it. So I tell you a little bit here about how did I find this file. I wanted to find out what are the census tracks for my county the county of Philadelphia, Pennsylvania. And so I needed to find out sort of like, you know what are the boundaries for the for the tracks here. And so I actually first wanted to get the shape file for the census tracks in the whole state. So I clicked around in the web interface of this URL. Let's go here. And then this again was just me googling, how do I get census tract, you know, maps for my state. There's a couple of download links or there's a web interface and an FTP archive. This is point and click and the FTP archive allows you to download directly. So I first sort of looked around in the web interface and said well I want like 20 I want really the the decennial census tracks. And so let's say the 2020 census. 2020 census tracks. So that's what I'm interested in. And I want, and I want Pennsylvania, and then I download it. And then what it downloads is it gives me this TL underscore 2020 underscore 42 underscore track that zip. And what I can learn from that is I can learn the naming convention. I'm going to take all of this out to you because you will have to learn the naming convention of files. When you're sort of looking around into public data and trying to figure out how to make this more automated. Right, because what you don't want to have to do is go through and describe click here click here put the state and click here download it now then move it over, you know, to your working directory, right you want to get that link. To understand what the file is called and where to look for it, you have to do it the human point and click way first. All right, so that was using this web interface. So then what I did is, I went into the FTP archive, and I was like, Oh, this is interesting like let's look around here and what's what's what do we have here. All right, let's look at tracked. All right, and then I've got all these TL 2021. All right, perfect. All right, so these are all 2021 underscore 01 track that zip. And let's just say 42. I know this is 2021 but let's say that's fine. All right, and that's going to download it so what I can do there is just copy that link address, and that's going to give me my endpoint. Right, so that sort of gives me the pattern for how I can download this. So I did a little personal sleuthing to figure out what the heck this would be called. All right. So yeah so that's how I got this link. So I'm going to download that zip file. And where is it going to go. Right, I'm going to unzip it in my data directory. And that's just because that's what I chose you can choose, you know something different in your project. I'm going to give it the same name TL 2021 42 track dot zip. Right, I'm going to do the same thing. And that is this block so let me just run all of these blocks to this point. And now I'm going to download that zip file. All right, so it is downloaded. Perfect. So now what's inside. If we look inside the data. At there's my zip file. And then I did this unzip and extracted it to this folder. So now within this folder. I have some some stuff here. Right. And again, that was all me doing point and click, which is super useful and easy for me to do as a human. But let's do it in an automated way. Ideally what I would like to do is list the files within my arm work down. So I'm going to do list files. Let's run that. And you can see that within this shape file. We've got seven different files. We've got some XML files right here. And we've got an SHP that's the main sort of shape file and we've got a bunch of other stuff. So to have a valid shape file, you'd have to have three files at minimum. One is the dot SHP. That's the main file that describes the geometry. So if you think about any sort of polygon or shape, like whether it's a county or a state or a country or a census tract, it will be a closed polygon, each vertex or point of which is a latitude and longitude point. If you connect them all up, that's when you get the shape. So that's what the shape file or the dot SHP does. There's the SHX, which is an index file. And then there's this dot DBF, which are the attributes data. All right, so these are the three required. In our case, we have a few other things. We have some XML, we have a code page, and we have some project. So I'm going to pull that data. And again, all I have to do is point to the directory because the directory holds all of those files. I'm not pointing to a specific single file, but to the directory that holds all of those files because those files together form one sort of geographic reality. And I'm going to use the same read OGR from RG doll, and I'm going to call that PA for Pennsylvania. All right, perfect. So there's PA it appears in my global environment as a spatial polygons data frame. And we can look inside there. So we can see similar to what we did before the break is we've got some information in the form of a data frame. And importantly, this data frame is a data frame like any other. So I can, you know, I can use all of the same things that I would use dplyr for I can do on this data frame. We have a bounding box, which shows the latitude and longitude to the max and men. What are the north, south, east and west extremes. What is the box that forms this area of interest. Some input information about projection and a list of polygons so there's 3400 census tracts in the state of Pennsylvania. All right, so let's look at the data frame. What is in this data frame. So we've got the state so Pennsylvania is 42. We've got the county. So here's one county County 77. Here's County 71. We've got some tracked information, the geographic ID, which if you look is the concatenation of the state county and tracked the geo ID. That's the name of the census tract. 62.03 some more information. What is the I'm not sure what in TFCC is to be quite honest. What is the area of land and what is the area of water in this tract. So that is the data frame. Let me just expand PA. So that is the data frame at data. All right, quick aside about FIPS just because you will hear people if you're requesting data or people are requesting data of you, they will say give me the FIPS. Well, there is no the FIPS FIPS is a standard. So it's like, give me the metric. Well metric is a system. FIPS is a system. And we sort of talked about this, but a geographic identifier is the concatenation of information from the most gross to the most fine. So state county tracked block group block. And so, depending on how much information you want or have your geo ID may be a different number of digits because maybe you go down to the block level, maybe you don't. But if you've ever wondered how this gets what how FIPS are created, there's information about that. All right. We're going to just quickly draw a map of these census tracks. This is similar to what we've done before. So I do add this suspend scroll, because leaflet is a JavaScript library so it's, if you're used to scrolling through information on your web browser, it may happen where you're like, oh no I just wanted to scroll but instead I zoomed or zoomed instead I panned in this map. And the same thing can happen here in our studio. So the suspend scroll is supposed to make that less likely your mileage may vary. So this is the Pennsylvania census map. It's not that pretty. So I'm just going to remap it here. This is sort of the same information that we talked about before, adding some functions here that you can learn from. Fill opacity color etc for lines and fill color fill opacity for the fill. And then there's these great label options. So for the label, what I'm going to do is look inside the data frame and say give me that name L sad, which that is the the sort of English spilled spelled out census track so and so. Let's paste that to the actual geo ID. And then paste that to the geo ID ID. So the geo ID I'm sorry the text geo ID, then paste that to the actual geo ID. And then this tells me some things about how I want that label to appear. So let's take a look at this. And nothing happens done done done there it is okay. I'm going to remap because again it's 3400. But as you can see let me click. If I, if I move my mouse over. You can see that the label the pop up label is changing. So it tells me what the census tract identifier is. And the geo ID. All right. So again, this is useful if you wanted to say in this census tract, this is the poverty level. And this is the tuberculosis or asthma prevalence or something like that right is to have something where you can hover and look and leaflet makes it very easy for you to do that using label. So pause there. And ask if folks have questions to this point. All right. So, I talked earlier about the fact that there are three required files in the shape file, but what are these extra files so the next few lines of code here are looking at what are these extra files. So this is the dot PRJ or projection. So this sort of if you're a map nerd and you like different projections. This will tell you the projection that the map is using the character encoding, which is utf a which is sort of the standard character encoding. And then there's some metadata as well in these XML files about sort of, you know, who's responsible for this who maintains it. What's the data set you are I it when was this created, etc, etc. So there's two files that have information about that. And that's a shape file. So all of that was breaking down what's inside a single shape file, right you download this zip. And it contains a bunch of files on the inside that together give you the geographic data. And then there's the esri shape file standard. Then there's Geo JSON. So Geo JSON is essentially a JSON file if you're familiar with JSON, that also includes a list of points that says by the way the geometry for this row is the following to this lat long to this lat long to this lat long, etc. And it could be multiple lists of multiple points to say like these are the edges of this polygon, or it could be a lat long point that says this is the point. Right. So it just includes geographic data as one more element of the Geo JSON. So this is the Senate, the New York Senate District. So this is for state Senate in the state of New York in the US. All right. And so the URL here. This is from the data portal from the New York open data. And you can see that I'm exporting it as Geo JSON. And so what I'm going to do is pull that in just as JSON. So I want to look at it just as JSON without really knowing, let's say that it's Geo JSON just so we can look inside it and sort of look at sort of some of the JSON features here. So we can see that there's these properties like shape area, the shape length, what Senate district is it. And then there's these geometry. Right. And you can see that the geometry are all of these lists and these lists of lists of points that are connected one to another. Right. So lots and lots and lots of lat long points. All right. Again, let's just look in our properties. That is the shape area shape length and state Senate district. Right. And we can look at the geometry. Now again, the geometry is huge. It is many thousands of points because you can imagine a Senate district. And especially if it's been heavily gerrymandered could have thousands of tiny little lines line segments between lat long that create that polygon. So instead of showing all of them, or, you know, very many of them, I'm sort of going into some of the structure here to show you like just one small part of this. All right. So these are. These are a bunch of latitudes and longitudes. Right. And each one of these forms a point. Right. Each of these pairs forms a point. All right. So that's just looking at it as JSON as vanilla JSON. If we pull it in as Geo JSON using the same read OGR, which is what we've been using sort of throughout. The nice thing about read OGR is it'll take a standard issue map whether it's shape file, whether it's Geo JSON, and it will convert it to look the same. It's an abstraction library. Right. So it'll take various methods and abstract away the differences and give you a common data format. Okay. So let's look at this one. This is a spatial polygons data frame again we've got that bounding box. We've got the data. We've got a list of 28 polygons. And then we can again map it using leaflet. We're just saying, Hey, this is the data we're using New York City Senate. This is the view that I want, like what is the middle and what is the zoom level. All right. And then please give me polygons. And I would like for them to have lines that look like this. I would like them to be filled in a way that looks like this. And I would like to have them labeled with in this case just the state Senate district identifier. And I would like the style at to have this kind of CSS or in web what will it look like. So let's hit go there. All right. And here, this, when I first did this I'm like, why is it showing Scandinavia. This is not this is Manhattan. I believe that's right. So this these are the state Senate districts within New York City. Right. So, oh, but unfortunately I don't know why it's not showing the labels. I'll have to double check that. So did I get the, oh, I got the label wrong. Instead of should be ST underscore send underscore dist. All right, so let's try that again. Oh, there we go. All right, so 2859275918151416 etc. All right, not a question says Daniel, but I believe you can also use the Tigris package to pull census tract shape files and gives an example. Thank you so much for that. That's a lot easier than trying to find it on the web like I showed you so that's helpful. Next steps. So, you know, the chances are that you have your own data, you have collected data about your patients, or your research subjects, and you have it linked to their census tract, or to their county, right. And the nice thing about these data frames that are within these sort of geographic structures is that they're data frames, right, so you can imagine very easily adding another column and pulling in that data from another data source using merge, or using a join. Regardless, you just want to make sure that the map that you're using, and the tabular data that you're using have the same information. Right. So if it's, you know, capital C census capital T tract, and then something, and then a number with a decimal point. You just want to make sure your formatting is the same or using a geo ID, so that when you merge those two data frames, you can find something useful. Here I also sort of speak a little bit to a question that we had earlier, which is, what is the placement of leaflet sort of in the zeitgeist of mapping in our and the answers there's tons of different ways that you can map and are. But I like leaflet because you can use it in a number of different places it's highly portable. And in Python it's called folium you can do some searching about folium there. And so that's it for opening map files so we looked inside a shape file and then we mapped it using leaflet. We looked inside a geo JSON, and we mapped it using leaflet. Any questions comments about that before we take a break. And then we'll jump into census data. All right. I've been talking a lot would it be possible to add a heat map on leaflet. Yes. And so one of the things that we'll do in the next section is we will create what's called a choropleth. And that is sort of a traditional heat map where each polygon will have a shade of a color that is more or less intense, based on the data belonging to that particular polygon. Oh, this is a really good question. And Jonathan asks about how you would take a list of patient addresses for ED visits and then merge with census tract for mapping. So there's a couple of issues here. One is how do you get an address to be a lat long, and that is called geocoding. Now if I were to geocode my address, it's very easy to do. In fact, the US census has a nice geocoder so census geocoder. And I can put I'll put chops address. I'll do this find locations one line. So I work at the Children's Hospital of Philadelphia so I'm going to say 3400 Civic Center, Boulevard, Philadelphia, PA, 19104. I think that's right. So this tells me the latitude and longitude. And that's what you would need for a map. However, when you are talking about patient addresses that is PHI, you do not want to go into a Google geocoder or a census geocoder and geocode those addresses, right, because you are sending that address out into the public internet. And that is not a safe way to handle data. So different places. Different, it really depends a lot on context. So if this were in the, if this were in the context of a hospital, many hospitals and universities have access to on site geocoding databases. Instead of going out to the web, they in their data center so Children's Hospital of Philadelphia, in its data center can geocode addresses and in our clinical data warehouse, we have the lat long for every patient address at every address in their life. Right, so we can tell what was the lat long when they lived, you know, in 2008 and 2015 and 2019 and 2021. So I would say first I would talk to whoever is in charge of your medical data, and just say hey, if it's not in the EHR, is there a data warehouse or something that I can pull that geocoded data that lat long. The first question is how do you get it to lat long. Let's say you already have the lat long you're like, Joy, I, I, the address to lat long was already done. That's not what I'm talking about. Well, then you have to do to map, which points belong to which polygons, right. So you will have a polygon that encloses a certain amount of space. So if you have some addresses, or locations right some lat long points, some of them fall within that shape, some of them fall without that shape. And different libraries have different ways to do this SF has one called over OVR. And it basically says, will you please sort of throw these points against these polygons and give me the result back of which polygon, each of these points belongs in, so that I could map crime, let's say incidents of crime which have a lat long to census tracks and then do a map of census tracks, where's the highest crime rate. Really good question about how do you convert zip codes to appropriate census tracks over years given that zip codes change and so do census tracks. We will talk about this more in the next section. So I don't want to give away that just now. I mean, but we will get there. So let's take a pause here. It's, it's almost five after the hour. Let's give a full like, let's say 10 minutes and come back at 15 past the hour. I'll be here. I'm going to mute myself. But if you have questions, feel free to pop them in chat. And I'll be here. All right, welcome back. Lots of great questions and I've done a little bit of looking around to find some resources to answer some of these specific questions. So I will go what I will do is and I answered these in chat. So for those of you who were like, but I went to drink coffee, why did you do something while I was gone. I didn't speak. So you didn't miss anything. It's all in chat. I will go from the most recent question. So the bottom question in chat and move my way upward. So, one question was how can I tell if my data is polygon or point. And once you the easiest way is when you ingest that data to use the str. So let me show you what I mean by that. I believe that is the simple map from Socrates from Socrata. If you get this data, if you look in simple maps from Socrata around line 48. We'll go a little bit about that 46, you'll need to know if you have points or polygons. You'll need to know if you have points or polygons in your data. And so how can you tell like let's say it's something like crime data, and you're not sure if it's like the counts for each shape. So the counts for this county and the counts for that county. So that would be polygon data, or if it's points like there was a shooting here and a robbery there. And a burglary here and a car theft there, and they're at point so you're not exactly sure maybe the easiest way to do that is to do an str or structure call on your data. Once you've ingested it with a max dot level when I said max underscore but I think it's max dot max dot level equals two. And what that means is only going to go down to levels of hierarchy so it's not going to get you down into the, you know, the list item of the list in the column of the data frame that's a list item it won't take you down to all those details. Oh, you don't see my screen. Thank you I'm so sorry. Let me go into zoom. What's happening zoom. Where are you. Hang on. Technical difficulties. Do do do show all windows. No available windows that can't be the case. Oh, what's strange is I see my chat window and that is the only zoom window I see I apologize, folks, let me just I'm going to close all my windows to see if I can get my zoom window to come back. Do do do. Aha. All right. Let me share my screen. Thanks for, thanks for calling that out and thanks for sharing with me as I have technical difficulties. Don't worry, I'm an IT professional. All right share screen and I will share my our studio cloud Chrome window. All right. Now you should see my screen. So some simple maps from sacrata.rmd line 46 through 50 kind of show you how to tell what kind of data you have. All right. So that was one question we talked about another question which was a really good one was how what do I do if I have if I want to show data from a couple of different geographic areas like let's say I have a couple of counties and one state and a couple of counties in another and I want to show just that metropolitan area or just that zone. How do I do that it's not going to come ready made for me that way I'm going to have to sort of like say I want this but chip off all these files I don't want etc. So I did find a file. I did find the file where I worked with. That's finally copy this where I did this work and I call it crafting just the just the file, just the map you need. And so the history of that is I needed to work in the Philadelphia area and Philadelphia is sort of right on the border of New Jersey and it's very, very close to Delaware. And so I was doing a project where I needed to find sort of the this the the six county area. That was the catchment in this case not for my hospital but for a side project I was working on an animal shelter. And so I sort of wrote up what I found. So this is an organization that had I'm sorry it was a seven county area. So I here I give you what my steps were and I say here shape file, but this will probably work I imagine with. I'm sorry these are maybe just two states, New Jersey and Pennsylvania. So I say shape file but I think it will work for Jason as well. So I get the data from one state, and then I just filter on just the counties I want. And then I get the shape file of the next state and filter on just the counties I want, then I have to combine those files and then map it. So I won't go through this, but this is on my our pubs. And I've got a bunch of this is from probably 2019. Yeah, 2019. So your mileage may vary on some of this. If you try this, and it doesn't work, reach out to me. And I will spruce this up and make sure that it works so that you can figure out what you want. But this sort of walks through sort of that use case. Alright, so. Perfect. Alright. And then I went up and oh Federica had a really great question about how to use these endpoints. So, like let's say you have an endpoint like this where you looked at so crottas. Let me zoom in. This is awfully small. You looked in their catalog and you found this great endpoint. What do you do, like how do you use this. And what you can do is scroll down and there's this sort of pinkish grayish kind of box kind of in the middle under getting started. And this, I would say really like stick to the getting started area. And this is your endpoint. So you can see here it says dot Jason. And in the upper right hand corner, there's what kind of file type do you want, do you want CSV. Well now the endpoint changes, right, or do you want to do Jason. And that changes, right. And so that is your endpoints that that's the thing that you're going to want to work with. Oh, and helpfully it shows you what it looks like. Look at that. That's pretty helpful. And then there's often, you know, some information here that might be useful about the documentation about how to use this etc. Alright, then there was another question about using over, right and I said I had a resource that I would share when we came back. So this is an example. And this is from a chop website. And this again is a couple of years old this is from 2018. So again, your mileage may vary. Some of this code may not work as well. But this is something that I wrote to talk about environmental exposure. So if you think about exposures. So something like children with autism, if exposed to water, right, some children with autism are drawn to water. So that could be a danger for them, or, you know, in Philadelphia we have shuttered oil refineries. So how close are certain patients or certain neighborhoods to these shuttered oil refineries that could be an exposure, right. We have areas with a lot of lead in our soil because we had shipyards, right. And so this is an example of how to. There's a bunch of stuff I do in this example, but apropos to the question about combining points and polygons here in this example. I take the census tracks from the city of Philadelphia and these are the 2010 census tracks to be clear. And those are polygon or shape data. And then I have points that I want to then map to those. So, here I go. I will, I will. Let me jump to mapping points. All right, this is leaflet. All right. And this is one of the nice things about leaflet is you can see that there is not a lot of code here. Right. And I have this thing called add markers, and I say cluster options. And what that does is it creates this sort of clustering points together. Right. So, and this is just leaflet being leaflet and doing its thing on its own, which I think is very cool. So these unfortunately this is real data. And sadly, there are there were in 2018 this many shootings and in 2022 it's many more, and you can see and you can just drill down and get an idea and this is just straight out of leaflet. Not sure why these little points are not resolving and they have those image not available. All right, so those are points, but what I want to do is map those to census tracks. All right. So here I say it's sort of like binning, but using the existing bins of census tracks. So what I do here is I, I grab a map of just census tracks. And then what I can do so these are the census tracks. That's the city of Philadelphia we have 382 census tracks in the city of Philadelphia, the city slash county. And what we could do is just visually overlay them. And this is just using ggplot. Right. So this is visually overlaying points to polygons but from a data perspective, maybe what I'd like to do is do a heat map or a choreopleth. Right. And so what I want to do is say well in the, the upper left hand corner of Philadelphia there's no shootings. But right in the center Philadelphia there's many, many shootings in a given census track. So how would I do that. And so then I sort of explain how to use over. Right. So what I would like to do is take the coordinates of the shoot of these shooting events and and put them against the Philadelphia census tracks. And then what happens is I get back this data that tells me which census tract they land in. Again I'm not going to go into this, this is there for you to look at and again if it does not work. I want to know about it because I want this workshop to be useful for you so please don't be shy about reaching out to me by email and just saying hey I tried this it didn't work. It's a couple years old do you mind like refreshing it I'm happy to do that work. And I'll just take you right to the sort of big reveal at the end here. All right. Oh, there's a little bit of. So these are the shootings in Philadelphia, this exercise and average minutes per day. This is fake data. This is me intending to show if I wanted to combine data. So I fabricated some fake data. So ignore the exercise. But these this is an actual core a path or heat map that shows the shootings in Philadelphia from 2015 to 2018, where I have mapped the points to each polygon and each polygon is colored with an intensity along this gradient. All right so hopefully that is helpful to you. So that answers most of the questions. Let me just jump back in. I have many, many tabs. Here we go. Excellent. So let's jump into census data. Now, if you are not interested in US census data, or you are not in the United States I apologize in advance. This is US centric and I am aware of that sort of privileged standpoint in your in the area that you're most interested in that if it's data from another country, there are probably some offices of, you know, sort of demography offices that have similar information that you may be interested in. So the US Census Bureau trivia in article one section two of the US Constitution, the US Census Bureau is tasked, it's not called the US Census Bureau of course at that point but it is tasked with the role of doing a decennial or every 10 years census for a portion of seats. If you're interested, if you're a math sort of nerd and you love really complicated math and or you're interested in privacy from a sort of a mathematical perspective or legal perspective. The decennial census it's really interesting they've done a lot of work on a term called differential privacy. If you don't go into it it's, it's, you could study this people get their PhDs and it's very interesting. So, the decennial census is really only required by law by the Constitution to do overall state numbers right so how many people are in each state. And this determines the number of seats in the US House of Representatives, because we are a bicameral parliament in this country, and the lower house is proportional according to the population of the state. The US Census Bureau does a lot more than the decennial census. It also does the American Community Survey or ACS, which collects a lot of information about different neighborhoods. And so there is a, I mean the number of variables that they ask about is really incredible there's lots of really detailed information, and this is, and this is done by statistical sampling right so this is not enumeration is not precise. This is the number of people and we've counted them one by one. This is using inferential statistics. So this is really going to be interesting for you for questions of the social determinants of health. Things like how many bathrooms are in various housing units are their housing units without bathrooms. How far do people commute and what do they use to commute. What is the percentage of the population in this census tract that is below the federal poverty line, etc. I mean it's really very very rich the kind of data that comes out in the ACS or American Community Survey. There used to be three versions of the ACS one year three year and five year sort of sampling windows. Now there's really only the one in five year. So five year ACS data is collected across the entire country. Major metropolitan areas get one year mesh sort of samples as well, but the sample size is smaller and less robust. So from a statistical robustness perspective as well as a coverage perspective. I suggest the five year version and that's what we're going to do in the workshop today. There's tons of other stuff that the Census Bureau does. It's, it's really incredible the amount of data that is collected, we will barely scratch the surface of it, but you know feel free to go down that rabbit hole as you see fit. So census data is highly highly geographical. So there are various levels of aggregation. There's a whole how many people are there in the United States States and territories. What is the average income family income of people in Montana, right counties zip code tabulation areas the question was about making. There was a question in before we had our break that I said I would allude to today, which is. You know correlate or connect census tracks to zip codes, when zip codes change, the census tracks change to. And zip code tabulation areas are the censuses way of saying this is pretty much the zip code, but the post office does what it does for its reasons and we do what we do for our reasons. And I can guarantee that every address and every location is precisely right so ZCTA or zip code tabulation areas are very very close approximations of zip codes. Urban areas like you know New York City. And the census tracks, which are aimed at one to 8,000 people so a census tract in Montana will be very geographically large, whereas a census track in the city of Philadelphia will be to city blocks, right. Census block groups, which make up, which are clusters of census blocks, which are intended to be between 603,000 people, and there's probably other levels of geography. As you can imagine, the smaller the geography the higher the risk of re identification of individuals based on statistical or enumerated characteristics, and that's part of what differential privacy is all about is sort of doing some smoothing of the data and adding some noise or jitter to the data. That is great for privacy, less great for precision work, and we're not going to get into that today, but I just want to flag that. I also want to flag that census tracks do change in a 10 year in the decennial cycle, right, because the census realizes oh there's this one needs to be split, or we need to move the boundary over here because there's more people in this neighborhood. So, these do change. That being said, there are typically many tracks, or at least multiple tracks, I would say within a zip code. So a zip code is a larger entity than a census tract so there is not a one to one mapping. So working with zip codes, ZTC TAs are the best idea. Now you may be asking that because maybe you're, you have something called a limited data set, which is a sort of term of art in in privacy and HIPAA and the federal common rule, describing certain kinds of data sets. And in a limited data set you might be able to have zip code level data depending on the size that the population of the zip code, or the first three digits of the zip code or things like that. A few years ago, there was some talk in the Office of Civil Rights about allowing census tracks to be included in a limited data set because census tract information is so broadly used in sort of population characteristics. So if you do not know where that stands, I would say, again, census tracks are a small population and the risk of re-identification goes up. So please behave ethically. Talk with your institutional review board. You know, make good choices around your use of geographic data. Let's see. The website of the Census Bureau is fantastic. There's also a sort of human facing data site at data.census.gov. This is definitely human optimized, not sort of machine optimized, but it'll give you, you can definitely waste a day just looking at different data sets and seeing all the incredible rich detail that's there. I'm actually not going to have us do this exercise just in the interest of time. I will leave this as an exercise for the reader. What kind of data, pardon me, or information would be useful for you in your clinical practice or your advocacy work, your public policy work or your research, lots of potential there, please do check it out on your own time. The Census Bureau has an API that is useful and great and has really good documentation and you are about to go get an API key and we're going to use it. So this is what the Census Bureau says about its own API key. If you're going to use Census data more than once in your life, I would say get an API key. It is free. It is easy. And I won't read that quote to you because that's annoying. There is essentially a throttle that says you can get some stuff with no API key just anonymously. But if you're going to be getting a lot of data, please get an API key. It comes at no charge, but it's a way for them to say, oh, that's interesting, you know, this entity is asking for this data. And they're asking for enough of it. So we kind of want to know kind of who they are. Right. So that way, let's say, let's say you created an application, and it's just hammering the server. And it's asking for the same data over and over again. They have your email address, and they could be like, hey, this is problematic. What's going on? Can we help you figure out how to use our API more effectively? Because we have to pay for all of this outgoing data. Happy to do it, but it sort of seems like something might be up with your application. Can we help? And they themselves point out, once you have an API key, you can use our Python, etc., to grab that information using, you know, an API endpoint, which is just a URL. So what we're going to do is have you, right now, go to api.census.gov slash data slash key sign up and get your API key. So let me do API key census and find that and drop that into chat. Please do that. In my experience, it takes less than a minute for it to come to your email, but I want you to do that now. And then in a couple of minutes when we're ready to use it, it should have arrived in your email. So I'm going to pause and stop talking. While you do that, I'm going to go back over the chat. Oh, great question, Beth. Are we putting our name here for organization name? That's what I do. I just put my name unless I know I'm doing it for work. But I just put my own name. I'm just dropping some answers to chat. So if you asked a question in chat. Oh, thank you so much, Raymond, for mentioning place. All right. So hopefully everybody has had a chance to do, to sign up for the API key. So let me go back here. All right, so there's a bunch of different API endpoints. I will, again, you know, leave that as an exercise for the reader that this is a URL. Is this a URL that you click on? Yes, it is. So there's a bunch of API endpoints. So if you're looking at the this presentation in our pubs, you can just click on that. And the tidy census is a package that helps you work with the API's from the Census Bureau. So you'll see that sometimes I've used, sometimes I've used packages. And sometimes I haven't like in the Socrata, how is it? Like Kuna Matata, Socrata, Socrata. And the Socrata file, there is, I do everything sort of manually. I don't use the R package that is intended to sort of abstract away the details of that API. And then sometimes I use a package. So when, when do you do which? I have my own experience of different packages are updated at different cadences. And some packages I just like better. I think it's important. I do a lot of explaining to people, for example, how to use the red cap API to work with, with red cap. And there are some great R packages that work with the red cap API. However, I think it's always useful to know how to do it the manual way, in case maybe there's an update that red cap makes that the makers of the R package haven't had a chance to catch up to, right. So I think it really in most cases, it's six of one half a dozen of another, if you do it a manual way, or using a library. In this case, I've been very happy with tidy census. I feel like the census API is particularly dense. It's a little. There's they have a lot of information. It can be. It can be a little difficult to work with manually. And the Census Bureau has a lot of customers. So this R package is updated all the time, right. And they are staying on top of it they are really hewing to the API of the Census Bureau. Not as many customers, right. I still like the package I still use the PubMed package. But that could just be like a couple of people like in the lab as a hobby, keeping that package updated no promises that it's going to be 100% up to date. But I like tidy census. There's some great documentation that we won't go into, but the American Community Survey API handbook has lots of information about sort of how URLs will be structured again, you don't have to worry about this because we're going to use tidy census. But this is a two chapter handbook that is worth reading and worth looking at. And if you, like me, normally can't stand reading sort of government standards and documentation I have to say this is a chef's kiss amazing version of like really good documentation. The Census Bureau really has their stuff together around this. So we talked about FIPS already. Again tracks and blocks can and do absolutely change from decennial census to decennial census every 10 years. So here I get the example of the Children's Hospital of Philadelphia and our geographic ID, which is again a concatenation of our state code, which is 42 for Pennsylvania, our county code, which is 101, and then the census tract and block. It's important because if you are getting data, let's say from your data warehouse you have a clinical data warehouse, and they have the census tract. They may only have the census tract, which is, let's say the census tract is 107. What do you have any idea how many 107s there are in the United States. 107 but in which county and in which state. So when you request data from an internal organization make sure you're getting enough information to really locate this. The tract alone is not enough information. You also want at the county and the state or territory. A little aside about the granularity of data, the American Community Survey is extremely specific. So let's say for example, you want to do some work some disparity work or you want to talk about, you know, something about income and asthma or something like that or access to care, etc. So you're really looking for for income. Well, there's income and benefits for total households. And then there's these different tiers. And then there's family so households and families. All right, like what's the difference there. And then there's income and benefits. And then there's a with SSI. And there's a bunch of different other granularities. I really have to get very specific about what is the specific data I care about. Do I want to include social security do I want to include welfare am I talking just about families, or also other types of households which could be individuals and individuals not a family or a group home. If it's a variety home or something like that, that is not a family those are not people that are related to one another they live together, or roommates who live together, but maybe together, you know, their salaries, their combined income pay for, you know, roof repair and things like that. So you really have to think about what specifically you want because the data is highly, highly granular. There we go. All right, again, what question are you asking, and who are you asking it about about everyone about children about senior citizens about white people about men, like what are you asking, and really get those questions narrowed down. So there's another thing that you need to consider which is, how does the American Community Survey give its data. There are four measures there usually four measures associated with any variable, and only two of these make sense for a given variable. You will either want estimate and margin of error, or you will want percent and percent margin of error. So estimate and margin of error are used when there is a scalar value, right, like in like median income, right, or number of children in household, something like that. So estimate and margin of error are when there's a percent, like percentage of households under the poverty line. Things like that. All right, so just what are you looking for. Every area in the United States is covered by a census tract, whether or not people ordinarily live there. Importantly, in the decennial census, there are people who live in places where people are not sort of expected to live. So unhoused people maybe will be counted in the census tract in which they reside, regardless of whether that census tract has lodgings or not, right. And every area, whether it's an airport or a lake has a census tract. So given a sufficiently large geographic area, you will have areas that have no data. That is typical. There is no median income for a census tract that is in the middle of a lake, for example. So this is to be expected. So let's jump into the practical work I'll do this with you. What you're going to do. Hopefully you created the private directory when you downloaded this data. I'm actually going to, I think this I have my private key on my local machine so I'm not actually going to do this from the R studio dot cloud. I'm actually going to pull up my R studio and change the view of what you see. But what you want to do is open your email that you sent your census API key to check and spam, right, and find it should be an often numeric. I think probably 16 characters long something like that. And it's a it's a really simple email and I love it. They, I don't know if they still do this for the last time they sign off with have fun, the US Census Bureau. So find that text. And you're going to make a new text file. And you're going to save just that tax those 16 characters or however long it is in a file called census underscore API underscore key dot txt it needs to be called that, because the script that I provided you with expects that file. And that should be in the private directory. Right. So what I'm going to do is if I can find zoom that behaves for me this time. I am going to change what you're looking at so that you can see my R studio window. So new share. And I want this share. All right. So you can see that I have a number of folders here scripts and images and data, which you all have I've got some other files that you probably don't have and that's fine that's not a big deal. And I also have this directory that I call I added called private and inside private I have this file that I will not open because it is it is my secret API key. But that is simply a text file and inside that text file is my API key. All right. If you don't yet have your API key, do not worry. You can catch up at a different point but I want to make sure that we get this in before we close for the day. So I'm going to go all the way back up, go into scripts, and go into census data that are MD. All right. So, a lot of the very sort of top of the, the file are all things that I've already talked about. So this is so this are marked down has all that information about granularity of data and the API and all that stuff. And all this verbiage, I'm just going to go right past the first probably 150 lines, but it's there for you. So in case, you know, you're tired. It's been a few hours, you want to read it later it's there for you. All right, I'm getting a little bit of a lag here hang on. There we go. All right. All right. I'm starting at line sort of 133 where it says let's get census data. So the first thing I'm going to do is get my API key so let me just run all the cells up to this point. Using that button there and I'm going to run this. So now, oops, let me just, well now you see my census key anyway. I didn't think that went through to clearly but that's okay this is not what I pay for. So you can see my census key there. I'll burn it and get myself a new one when this is over. So you should never share your API keys which is why I have it. I recommend that you do this in a private folder that in that, in my case is not included in GitHub. But then you have to think about screen share as it turns out. All right, so I've got my key pulled in. So now what I can do is run the setup for tidy census. All right, and then, you know, I spoke a lot about how many variables there are and how specific some of these are and how granular. What I'd like to do is see what are the variables that are available for the 2025 year American Community Survey so that's ACS five that ends in 2020 so this is data from 2015 to 2020. So I want to pull those variables in. And then I'm just going to present them in a somewhat more friendly way using cable. This is just a presentation layer. It just prettifies it. So it's pulling that in. And there are many, many, many, many of these. All right, so there's a scroll box within. So I want you to look at the label. Then the concept. Right. So, and you can see there's geography as well the block group level, etc. So you can count, you know how many people are there sex by age. Oops, let me scroll up a little bit. It may be easier for me to do this in our pubs if my studio is too slow. All right, so people reporting single ancestry. It's an estimate total sub-saharan African Zimbabwean. All right, scroll down place of birth by individual income in the past 12 months. Total native born outside, etc. etc. You just scroll down there's all these different things means of transportation to work by time of departure to go to work. So you can see this gets like absurdly absurdly specific. So there are thousands of these. So how will you find the one that you care about. One of the ways is to look specifically at the concepts. So I'm going to take all of that data and just take the concepts. So as a reminder, there were these labels, and then there was a particular concept and there may be one, five, 10 various variables that belong to a specific concept. So let's look at just the concepts. All right, this is a little bit gives us a little bit more breathing room. Sex data, race data, place of birth data, nativity data, where were you born. Place of birth, place of geographical mobility, what's your tenure in your place, have you moved a lot. Means of transportation to work. Let's see. In living with grandparents, poverty status, coupled households, sex by marital status, 15 and older median age at first marriage, marriages in the past year, birth rates, college graduate school, school enrollment, language spoken at home, status, disability status, all sorts of interesting things there. So lots of places to look. So let's say I found a variable that I want to work with out of all of those variables. I found one that is median family income in the past 12 months. Specifically looking at families and 2020 inflation of just adjusted dollars. So I would like to get that information for the five counties that make up metropolitan New York City. So these are the five boroughs. New York is Manhattan. Richmond, I guess is Brooklyn, no that's Staten Island, Queens, Kings, which is Brooklyn and Bronx. If I got that wrong New Yorkers, I apologize. But these are the five counties that make up the five boroughs. And so what I'm going to do is is use the get ACS. And this is from tidy census. And I'm going to say hey will you please give me the level of tract. And this variable from the American Community Survey for this state please. And this county, which county this list of counties these five, and which survey the ACS five I want the five year. All right. Perfect. All right, so I'm going to do that. All right. So that gives me median income. So let's just take a look here. All right. So we have census tract one, Bronx County. This is the variable. Oh, there's no estimate for income. Well, that's about to that that goes back to that sparsity. I'm not sure where it's in this tract one is. But it could be a water feature. It could be the airport. I don't know. Is there an airport in the Bronx. I don't know, but there's no estimate there. Then since this track to has an estimate of $69,000 with a margin of error of 18,000 and so forth. So you can sort of see this data in a tabular way, which that's great. That could be useful. Right. And there's many more. I'm just showing the first few rows. There's many more because we're looking at all five boroughs and all the census tracks and all five boroughs. Right. But a couple of things. One is what if I misspelled one of those counties really easy to do. Right. That's why I recommend using codes and I have a link here that will take you to the code so you can figure out what they are. So here's an example for Philly, Philadelphia where I live again the status 42 the county of Philadelphia is 101. I'm going to ask for the same variable ACS five, and I've added this new line. Geometry equals true. All right, and that's going to actually give me the polygon itself. Right. So let's look at that data. So similar to our data from New York it has the name since this track 14 the variable. I want to live in this census tracks the estimate is $204,000 with a margin of error of 66,000 and then there's a geometry column that gives you a list of a bunch of lat longs that you can connect to make that census tract. All right. So, let's do this so then we can just play with this as we would with any sort of data set we can get summary statistics using summary. And we can say, All right, so what is the, the median of these, you know, median incomes for New York at 77,000, you know what's the first quartile what's the third quartile, etc. So that's the American Community Survey or the ACS. We can also get variables for the decennial census. Now keep in mind the American Community Survey asks a lot more questions, including really intrusive questions of people who are willing to answer these surveys, and are a small subset of the population so it uses sampling and differential statistics. The decennial census is actually counting people. So a couple of years ago, you probably got a postcard in the mail, and you filled it out, right. Or you went to the website or you somehow participated in the census and if you did not someone probably knocked on your door. All right. And there are a few summary files and the decennial census takes a couple of years to come out. Right. And part of this is because for privacy reasons that the more this because there is data on every single person or at least that's the assertion and that's what we want to be the case. There is a real risk in small geographic areas of things being identifiable. Right. Like if if there are only, you know, in my census block only so many 40 something white women, then it's pretty clear like who I am in that particular data set right that maybe you can tell exactly how old I am. Or you can find out some other details about me. And so in order to make sure that the data is simultaneously as accurate as it can be but also as private as it should be, and trying to get that balance right. This is where differential privacy comes in, and it takes a long time for this data to become public. So although it is the second half of 2022, the 2020 census is still not completely available. There's only like the apportionment data so state level data is the only stuff that's been revealed I'm pretty sure. So sort of summary files here I won't go into the summary file one summary file two and summary file three differences. There's a few links there that you can check it out. But since only the apportionment results are out for the 2020 census we're going to use the 2010 census for our decennial variables. So what can we find out from the decennial census. There's, if I look at decennial bars there's almost 9,000 variables, so there's still many many many variables, right. But let's look for just ones that include the text Hispanic. All right. So we're going to do that and to see what concepts are available there and we're going to do a use case. All right. So there are we scroll down a little bit there are, it looks like three. There's three that are visible but let me scroll. Oh yeah there's many more than three. So these are all sort of, you know things that are important or interesting that have something to do with Hispanic in that name that word is in the name. So let's say we have a use case. Here in the city of Philadelphia. We got a grant, let's say, and we know that there's a lot of influx. There's, there's established populations of folks from the Dominican Republic and Puerto Rico and there's also an influx of new immigrants from Honduras, and we want to open a new clinic. But we'd like to figure out where in the city, does it make sense to open a new clinic. So we want to look at the decennial census and figure out which census tracts have the highest population of Hispanic or Latino people, because we want to open a clinic that really addresses the particular needs of that community. Right. So let's just grab those variables. All right. So we're looking at Hispanic or Latino origin specifically. So there's three variables there. We're going to call these 4001, 4002 and 4003 for short, the three variables. All right. And so what we're going to do is we're going to get data on those variables which I already sort of said I want those variables to be the ones that meet these criteria, they include the phrase Hispanic or Latino origin. So I want those variables at the tract level for the state of Pennsylvania at the county of Philadelphia using this summary file SF1, and we want the geometry. All right, so let me run that. And let's just take a look at our data. So since this tract one, remember there's variable 4001, 4002, 4003. So there's a value there. It's not a percentage. There's a value, another value and a third value. And if you add 4002 and 4003, they end up to be 4001. Right. So I'm just going to jump ahead. Just a little bit. This shows what variables there are. I'm going to sort of reshape that so that my 4001, 4002 and 4003 form columns, instead of being in different rows. So I am still using spread. I think the new term is pivot wider. So forgive me for being a little bit behind in the, in the lingo. However, we're going to reshape that we're going to pivot wider or spread. So that data now for each tract, I have three variables. And I can I'm just going to fly through here because I want to make sure we have time for Q&A at the end. All right, this is just more presentation. All right, so total population, total not Hispanic or Latino and total Hispanic or Latino. So if we note these three values, we can figure out the percentage, right. And so I sort of talk about that and I talk about the math and checking the math, checking my assumptions. And it's true that two plus three always equals one the 4002 plus 4003 plus 4001 it does every single time. So now what I can do is figure out what is the percent Latino for each census tract. And so here I have 3.6% 2.68% 4.2% etc. All right. And that's great and interesting. But here's the, here's the question, what this in a table is not the most interesting thing. Right. What I'd really like to do is see it in a heat map, right, or a cora plus where these different census tracts with their different sort of intensity of this value of percent Latino are compared one against the other using color. Right. That would be really useful to me. So let's walk through this. All right. So the first thing I'm going to do is just sort of look at what does this include right. So this is. This is from the Census Bureau. This did not come from a source where it's, you know, we didn't bring it through open OGR. So it is not in that spatial data frame. Instead it's an SF type. So we're going to use the SF package to convert this so that it's in the data type that we've that we're used to saying, and we're going to call this as spatial using SF as spatial. All right. And now let's look at it again. And we'll see that it's a spatial data frame up. Oh, Philly Latino SDF. Sorry. So let's do this. I'm going to add. STR Philly Latino SDF for spatial data frame. And let's look at that. Now it's a spatial polygons data frame. Oh, and it's continuing to print because I forgot to do max dot level. So it's giving me every tiny little detail, which I don't want it to do. That's better. All right, perfect. So what I'm going to do is I'm going to create a palette. So color brewer allows you to pick a palette like greens or reds or purple to blue. And you can pick the number of bins and it will split up that color palette according to the domain of values that you want it to, to bin up. So in this case, my domain is the percent Latino. I want five bins. I want them all to be in this greens color palette. And if I don't have data, I want it to be the sort of medium gray. And this is just a hex color and you can use hex color picker to pick a color that you like. So that's my palette. What did I not run? What did I not run earlier? I'm just going to run all my previous cells. Sorry, this might take a sec. While it's running, I have my previous run of this is, is going to still show. So what I'm going to do is take leaflet and say leaflet, I am going to map this data. Filly Latino SDF, the SDF there is for spatial data frame. And I'm going to add some polygons. And I want the fill of each polygon to be different. So what I want is, I want the fill to be a function of that's that tilde, a function of the Latino palette applied to this rose or this polygons percent Latino. So it will be one of five colors. And then the border thickness and opacity and the color of the border and the fill opacity, that's all going to be sort of standard stuff that's easy to figure out. And then I had this label, and I'm going to actually put a label on there, this is percent Latino, that I'll pass in the actual value. And let's see, let's see, first of all, I got my palette working or not. Let's see. Library Color Brewer could not find the function color bin. I'm sorry, I'm not going to perseverate on this because I don't know what I did wrong. So for now, I'll just keep going and say, assume that I got my palette working, I will look at that after we end and figure out what the heck I did wrong there. If you know what I did wrong, put it in chat. But let's assume we have a working color palette here. And then what you'll see is something like this. So this is, again, the county of Philadelphia. And if I hover over, I should have done some sort of rounding here, which I did not do. But you can see sort of the percent Latino as I hover over, it's going to show like the various percentage. There's a couple of gray, not a number, because we don't know the value there. And then there's some very intensely dark green percent Latino 81%, 83%, etc. right 84%. This is Northeast Philly, for those of you who are from here. And so that is called a choropleth, that's sort of like a heat map. And this is a JavaScript version that's really great for embedding online, right, in a website. But what if you wanted to make a flyer or put this in a publication. So what you can do is I'm going to use something called map view. And this is essentially the same thing as we did before, but I'm going to add a legend. And I'm going to add a control, which is the title. All right. And whoops, and this is what that looks like. All right, and I did not mean to zoom and pan. All right. But that's essentially what it looks like. And then what you can do at the end is actually save that as either a JPG, a JPEG, or a PNG. So we're almost at the end. I don't think we're going to get a chance really to talk about our optional fifth portion that I thought we might get a chance to but I want to stop here. So I have some solicit questions. What did people what questions do you still have. What was unclear what applications can you foresee for this kind of work. Any questions or comments that you have. Oh, good question. Adam asked have you created corpus using a leaflet that are larger in terms of their scope like nationwide multiple states, or is it generally used for smaller geographic scope. It can certainly be used for larger things, however, because leaflet runs using JavaScript is a JavaScript library JavaScript is running on your local computer. Right, it's running if you're doing it in a web page it's running in your browser. Here it's running within our studio. So, imagine if you will, if you've ever done web programming and you know what the DOM is the document object model. It's like there's for everything that's on a web page, there's an object, right. So for each of these shapes, it's an object. So once you get to about 5000 or 10,000 shapes, things start to get boggy and slow. So you can certainly use it to map like let's say the United States, state by state, or probably even county by county, but not tracked by track. That is too many objects. Right. So, I would say I've used it more for smaller areas because I'm really interested in sort of hyper local disparity in Philadelphia. And so that's sort of like my like all my passion projects are sort of around local things, but you can certainly do it at larger areas. Is it possible to add text labels to these heat maps to highlight certain tracks or zip codes. Probably, I'm sure you can. But let's let's figure out how we would go about about figuring that out. I'm going to actually change my share to be back to my Google Chrome. Let's just figure out how we would how we would do this. So let's say leaflet add text layer. So leaflet works with layers. So that's going to be helpful for you and you Google. So leaflet add text layer and leaflet is in a lot of different environments. So let me add our leaflet add text layer. Back overflow is often good. All right. And so there's a few things here where they're showing how to label things. So I won't go into the sort of the weeds here, but just a little pro tip on how to search for this other questions comments concerns. So what I'm going to do now is I am going to go into my repo and I'm going to add some issues for myself. I'm going to add a new issue and you can do this to fix issues in work files. I'm going to change label for state Senate in New York City area. It doesn't work in PubMed example, make dates agree. This is where I had 2015 and 2020 in census. What's up with color bin. All right, so I'm going to submit that as a new issue and I do this to show you how easy it is to add an issue. That way I can see if you try this and things don't work. Then I will see it ever the world will see it. That's a great way to hold my feet to the fire, or if you're working with any sort of public repository. I have had Hadley Wickham reply on things where I'm just like, hey, this thing, it would sure be great if it did this. So issues are how you get things listened to. Oh, so it works for you Jonathan, I don't know what I did wrong. Do I have any favorite long form documentation or books for learning Raymond, I really think that, first of all leaflet if you just look if you just Google like are for leaflet. There's some really great samples that, you know are similar to some of the work I did. Similarly, you know, census. And also if you just there's people that do this, you know, better than me for sure. I would say just like look mapping in our, you know, and just see what other steps are out there, making maps with our drawing beautiful maps programmatically with our SF and gg plot to so that's a little bit of a different stack than I've shown you. I don't have a favorite source, but there's lots of people doing this work out there. So, yeah, don't take, don't take my word for it. I took you down a very narrow specific trail, and there's lots and lots more that you could do here. Can you dissolve tracked to show just county with leaflet absolutely. So, when I requested the data I said I wanted at the tracked level I could have said I wanted at the county level. That would have been just fine to. So it really is about what sort of data you start with. Right. So instead of like dissolving like saying these tracks belong to this county I would just ask for the county level data, instead of asking for the track data to begin with. All right, so let me add one more. One more issue. Let's see. Good. And then I will say I will add another issue and say add resources says you mentioned things in chat and gave links. Please include in read me. This is how the things I noticed today will maintain object permanence. Is there anything else I'm forgetting that I want to improve so that the next time you do a get pull on this you have everything that works. What if you were trying to combine tracks for an area that is not within the traditional hierarchy. Good question. So let's say you had your own algorithm and you said these three tracks are have characteristic a so I want to sort of melt them all into one shape that is all of those tracks together. Seven tracks over here have another characteristic be so I want them to be all be combined together. Like maybe it's a water shed, like which part of the water shed are they right or, you know, these tracks are all within five miles of this, you know, super fun site or So that's an example of similar to using over. I think SF is probably the best bet for that library is really good at sort of combining shapes. And you can do it in that I'm pretty sure. So let's just search SF combine polygons. Yep. Combine or union feature geometries how to merge multiple SF polygons using SF to combine polygons that share borders so that's how I would do that. Great if there's anything that you would like for me to think about any errors that you discover. Please do add them to issues or shoot me an email. This is something. This is an area I think where we can do a lot of good and really raise raise what could be sort of narrowly interesting data in a table and a figure in a publication and raise it to stakeholders to donors to foundations to the public in a way that is more intuitive because maps are really understandable at a very sort of gut level without having a lot of experience and statistics or data visualization or medicine or any of these specific skills. So I think this is a great skill for you to learn as an advocate as a healthcare professional. And finally, time is a non renewable resource you don't get this time back that you spent with me and I take that seriously. I'm grateful for your trust in me. And if there is a way that I could spend your time or I could have spent your time in a more valuable way I want to know that. So, I'm not 100% sure what the mechanisms are for feedback for speakers, but to be very clear, I relish and want to hear feedback, including hard feedback. I did wrong, so I don't waste somebody else's time if you feel like I wasted your time, or I got something wrong or I stumbled over something. I want to know that so please give me the gift of feedback I'm very grateful. Thank you for spending your time with me. There are some some more questions in chat. So I'm going to stick around and answer that. But I'll just be answering some questions for the next four minutes. But if you're done. I understand have a great rest of your day. All right. Let's see here. Oh, thank you so much for the thank you so much for the praise I also welcome praise hard feedback is useful praise is also wonderful. Thanks for the great points out I like to use polylines with thicker counter county boundaries. Does the over function allow you to identify which tracks are within the county boundaries. You can, but importantly the geo ID has that information embedded in it. So if you have a geo ID that's like in my case 42101. That 101 tells you the county, you don't have to figure it out sort of geometrically, the geo ID has that information encoded in it. So I think you, it would probably be a waste of time, you know to be quite honest, I hope that's clear. Great, great, great. I'm looking to see if there's any other questions. Let's see. I don't think so. Thank you to all of you thank you for those of you who raised questions who asked for clarification. I always pack way too much in any given time. So thanks for coming with me as I ripped through a lot of content. Alright, thanks so much everybody. I'll be right back for a little rest of your conference. I will probably see you in some of these rooms. Feel free to hit me up. Seek me out on LinkedIn, tweet at me, maybe I'll get less bad at Twitter. Probably not shoot me an email. Thanks again for your time and have a good rest of your day. And Rachel, I will stick around to see if you have any instructions for me at the close. Thank you so much. Great. Thank you so much, Rachel. Have a good one. You too. Thanks, Joy. Bye-bye.