 Yeah, thanks. Hi everybody. My name is Boya Kang and I work at CDC headquarters in Atlanta, Georgia And I'm pretty excited to be here to talk to you guys today Mostly because it's been a while since I've engaged with a non-government audience much less a community of data scientists And so today and this is full disclosure I'm actually not going to talk about our per se but rather what I'm hoping to do is present a greater context within which some of the tools and capacities and More importantly the philosophy of our entire data could probably be better integrated Just by you know showing a few case studies that I'm going to cover And so I kind of have a complicated background, but you know, I think on paper. I am an infectious disease modeler I spent most of my time in a lab that specialized in high-performance computing running large-scale agent-based simulations To model infectious disease epidemics to test various scenarios for interventions all for the purposes of helping inform public health preparedness and response So that led me to joining CC as a postdoc a few years ago And since then I've joined as a permanent scientist within the division of preparedness and emerging infections And so right now I'm part of a team called the health health economics and modeling unit and on the side I also lead the COVID-19 school data team so Within this talk, I'm going to sort of cover three I guess stories or examples or case studies just to give you all sort of a perspective or a small glimpse into What it kind of looks like within a federal response and I say this just because I understand you know with the recent training that I've had In computing and data science and things like that I think the reality is is pretty a bit of a culture shock especially in a response situation So I'm going to cover three examples one from the West Africa Ebola outbreak and then a couple from the COVID pandemic and Just to give you a little bit of a background on our team and kind of what it is that we do here we Have a day job as sort of like modelers and typical public health scientists But during the activation of a national emergency Our team helps to staff the emergency operation center at CDC and we provide Support through the modeling unit and so you know our sort of mission or goal and purpose as a team is to really help inform real actionable Policy questions that are of immediate public health concerns. So this is you know real-time decision-making Typically and most frequently in the absence of any you know data or good data And so you know it's often at odds with kind of what how you know We've been trained in with trying to clean data and try to get the data as perfect as we can But you know the the reality is that our audience or our client or that the people that I work for I'm is typically of a decision-maker or people in leadership People that have to make pretty rapid decisions, especially as an emergency is sort of unfolding at the beginning phases And so we need to provide them with answers or you know, particularly a number or a few numbers And they want it like preferably yesterday And so because of this it sort of necessitates a simple approach Both in our methods and our approaches, but also in the way that we communicate our results to people in leadership And who aren't actually data scientists with data analysts and so this isn't to be confused with you know a simplistic approach there is sort of a distinct difference But on top of that decision-makers also want to be provided sort of Ranger multiple options and what if scenarios because ultimately what we want to do is to be able to equip a decision-maker or a person in leadership or in policy to be able to compare Data-driven answers or modeling estimates against the decision-makers own intuition or subjective decision-making process Because at the end of the day ultimately all of these types of decisions that are made are Inherently subjective and that's going to happen in the absence or the presence of data whether it's good or bad And so I'm going to first start and take you guys back to the 2014 and 2016 Ebola outbreak In West Africa, and this was actually before I joined CDC But I really like to use this story a lot just because I think it's such a brilliant demonstration That sort of encapsulates But kind of work that our team does and why we do it in the way that we do We typically take a pretty less than academic or non-academic approach for the reasons I described before just because We have to make a decision It doesn't matter if the data is good enough or not We have to get it to that sort of good enough state while able to you know have an impactful sort of effect And so the the man that you see on the screen here is Martin Meltzer And he's actually my boss and I love this article about him totally recommend that you read it But this story really captures the whole reason why I wanted to work CC It was to you know, basically be like this guy when I grew up and luckily I work for him now So it all kind of panned out but if you remember the the West Africa outbreak it was you know a very Unprecedented sort of Ebola dynamic where it just kept increasing somewhat exponentially for quite some time And so our team he move we were sort of in touch and working for some of the highest levels of Leadership and government Included a national security council and so Myron and our team Provided a set of modeling estimates that basically described, you know What would happen in the near future or a longer-term future given a set of constraints or a set of Interventions and so what ended up happening was This 1.4 million number was what I guess the public or the news media outlets really clung on to and they sort of Identified this number as the most important finding from this modeling effort The problem with this is that you know as a model or this 1.4 number was really supposed to indicate the worst case scenario Which is the counterfactual This is sort of the extent of the impact of an Ebola outbreak that would happen if we did absolutely nothing And so some of the criticisms that Martin received and that our team received was that it was totally unrealistic We were way off by the time the outbreak was winding down But it's kind of the interesting argument here is that you know This is related to the public health paradox of you know, no news is good news We're anticipating fewer cases fewer deaths and so really with a lot of these interventions You don't actually end up seeing you know a noticeable effect of it or the impact And so what this sort of identified for us was you know The way that we utilize data to not only communicate to the decision-maker But it was just as important to figure out What is the way to convey this information in a way that is understandable by people of the general public? Because quite frankly public health you know captures everybody and so I think very few people within that group actually You know understand the nuances of modeling and limitations and things like that And so, you know things have changed quite significantly since this time I think with covet a lot of things have have happened that never happened before in a federal response I I recently stumbled upon this tweet by a dr. Tom Frieden who was the CDC director during the both obama administrations and it was I think sort of a nice validation at least for me Just you know, I'm not just fangirling my own boss here But the fact that you know models have been very difficult to try to explain You know not only to people outside of our realm like in decision-making but also to you know The sort of the citizen scientists and the general public who really should be able to use this data and kind of be able to understand You know, what is going on around them in their communities And so I'm going to switch gears here a little bit and talk about uh some covet examples and these are Some things that I have worked on previous or most recently um One of my first appointments to the response and this was when I was I think a second year postdoc So it wasn't that long ago I was deployed to actually the federal interagency team called the data strategy and execution work group I was just still around today But this is actually a team that was led out of the white house covet response And so cdc was just one member of this interagency that included folks from FEMA and asper DoD fda a lot of government agencies to you know formulate a more coordinated response effort And so under the structure, I led a team called the areas of concern Where you know our task was kind of simple It's that leadership really needed to know, you know, where are these hot spots or where our potential areas of concern emerging And to basically be able to you know Prioritize this list at a regular cadence such that you know folks from cdc and other agencies can conduct outreach Ultimately to deploy assistance in in any way that we could And so this assistance could take you know form of like contact tracers. Do they need more testing resources? Do they need? More nurses and doctors things like that and so This was an effort that I led for about a year and a half and what we sort of experienced was It was definitely Kind of deja vu every single new like new wave past or new therapeutic emergence or new vaccines were made available Was you know, where is the next hot spot going to be? Can we predict them any earlier? Can we predict them better? And so the key question is that sort of came out of this that we've had to sort of You know reinforce to our client or leadership or the decision maker was you know, obviously What is a hot spot? How do you define those criteria? What are the most important metrics and surprisingly this is actually a very difficult question for anyone to answer Especially for people in leadership And so one way that we you know kind of helped I guess guide or gear the the the requester is What exactly do you intend to do with this set hot spot again? What kind of federal assistance is available because in that way you actually are able to You know come up with a target with very specific objectives and outcomes that you're trying to measure and so After all that and this is a question that you know, we're still sort of grappling with now is How do you know if what you're doing is working? How do you know when to stop doing what you're doing? And this is something that sort of again emerged over time as the pandemic shifted and you know waxed and waned I think early on an epidemic where we had sort of that initial huge surge of cases over the over the winter months The whole country was a hot spot basically based on our criteria and the methods that we were using the whole country was read and so You know it obviously pointed us to you know further examine these sort of sweet spots along epidemic curve where these kinds of efforts are probably more effective At times other than you know other times I guess And so This is a lot of stuff on the screen. I know but I wanted to show you guys sort of the early iterations of this hot spot Or areas of concern identification and again, you know early on the pandemic when this was such a you know There was a lot of activity going around like collecting data trying to you know make these Predictions so that we could be more preventative than reactive We were trying to figure out how can we come up with a very short I guess slide or a summary of the key data elements that are the most important factors in terms of deciding whether or not We need to engage with a jurisdiction or a state or county And so I have sort of developed like devised this sort of a generic Set of metrics where it's sort of showing like at the top Because leadership really likes to see you know numbers and colors and things that they can consume You know very rapidly Um, and so you know We stick some of these nice time series on this side because I think it helps a lot of questions that they ask And so one question that actually came up repeatedly was you know the question of like test positivity because that actually is uh, it was at one point a pretty good indicator of Transmission risk in the immediate future And so you know the argument goes well, you know Have they been scaling up their testing has the volume of tests increased? Is that a reason why we have more cases and test positivity is high? Um to which then you know you can point at this time series where you actually have Multiple lines overlaid showing so the black is actually the test volume and the red is a test positivity And so by understanding, you know that test volume has remained, you know, fairly stable yet percent positive is increasing That is one way we can help leadership understand that no Testing volume is actually not the one that's you know, actually driving up test positivity and things like that. So we iterated this dashboard slide or the money slide as people like to call it just because you know, it's got everything in one view Um, it like over time, you know as vaccines came out isn't therapeutic submerged We developed these at various sort of levels of granularity. So this one is at the county This is just to show you guys an example of the state And all of these metrics are actually, you know publicly available You can get them from cova data tracker But I just sort of have you know the the general sort of explanation for what these metrics actually are displaying And so, you know, we also aggregate up to say like the cbsa or the the fema region level Depending on you know, who our audience was and what the actual intent was afterwards So i'm going to fast forward from the early days of the areas of concern and hotspot sort of detection To what we have now and so I took these screenshots from the cova 19 community profile report Which is a report that cbc publishes on health data dot gov Bi-weekly right now. And so it has you know, it's just a treasure trove of data almost We have a lot of data. I think available at you know, all the way down to the county level It's a very long report with lots of grass and visualizations But I put these three maps in here because I think these are the three main high level maps that we provide and I think it's a really good way to communicate that You know These are three very distinct ways to portray the data that help answer a seemingly similar question But only to realize that you actually need specific metrics to answer specific questions and not just have a generic hotspot definition So you'll see it right now for the cova 19 community level which just recently I think was updated to lead some of the guidance This really is just sort of telling you what is happening at this very moment. What is this the current status? Are we out of the woods yet? You know more nuanced definitions of something that we developed called the areas of concern continuum Where all the counties are sort of categorized within a certain phase or category of you know A quote-unquote life stage of an epidemic curve The cool thing about this is that we also implemented something called an emerging hotspot model Which actually is a machine learning model that can predict a potential hotspot within the next one to seven days With you know a pretty good degree of accuracy. I think it was about 85 90 percent last time I checked But the first map that we ever sort of developed or the algorithm that we first advised was this rapid riser counties This metric which this is only taking in case data And the purpose of this was to show the counties that were experiencing a high degree of sustained acceleration of cases So the idea is that if you have a rapid acceleration of cases the secondary transition that follows is likely to continue And so that's probably a pretty good indicator of you know identifying an emerging area of concern Okay So now I know sort of for a long time. So i'm going to apologize for going like really quickly so right now, um, I lead the school data team and I guess this is sort of maybe more of a success story or More of a way that we sort of overcame some of the challenges that we faced with this um And so the purpose of our school data team was to collect analyze and disseminate any school data related to covid 19 And this was a collaboration with the white house and the u.s. Department of ed where they asked to be able to maintain Situational awareness in terms of like a national overview of k through 12 public schools So this was both for real-time situational awareness As well as being able to track this data over time for retrospective analysis so that we can you know inform future Guidance and so some of our broad interests or objectives was tracking school closures in real time So knowing which school districts were closed due to covid related reasons or pandemic reasons Whether it was because people were sick or if teachers want strike or if there was a shortage of bus drivers As well as knowing whether or not kids were learning in school or remotely or like a hybrid sort of format We also keep track of mitigation strategies or like policies that schools and school districts implement over time Just because there really isn't a good federal database for you know any school data to be honest And then ultimately what we're hoping to do is to evaluate the sort of secondary kind of impacts on school populations So um for those of you who might have worked with school data before it's it's pretty challenging and actually uh very surprising to To get a sense of what this data currently looks like Um, you know, I think there are lots of sort of legislative and legal reasons why There isn't a sort of federal standard for a lot of school data But the data that we do have is typically collected out of yearly cadence. And so it lags Annually and so that obviously isn't really going to help some real-time response needs And so one of the one of our challenges was really to be able to Coalesce or synthesize or standardize some set of nationally representative data so that we can provide You know a general sense of awareness as to how many schools in the u.s. Are closed due to covid how many of them are masking What are the you know more feasible strategies that are being implemented more frequently than others and and so on and so forth And so one specific challenge that we we faced most recently was trying to you know deal with the geographic boundaries for school districts And so just to make a long story short When you think about school districts, they're typically linked to the district office address Which I think generally is located in a city center or a county center where It's a little bit more affluent than potentially some of the the neighboring areas And so you know within a school district there's going to be a dozens of schools potentially And so they're going to encompass a very diverse group of communities And with varying levels of income social status And so in order to sort of understand, you know the whole population of the school you need to be able to I guess delineate those boundaries such that you capture Just beyond the school district address itself And so one of the things that we were able to polish online actually and make it public is this learning modalities dashboard You can find it on hhs protect public But basically this was one of the I guess successes that we've had dealing with these challenges of school data Specifically the fact of you know real-time collection But also the fact that a lot of the data that's been collected is very patchy So we had to Take in I guess a lot of aggregate Or disparate data sources and figure out a way to coalesce them And so what we ended up doing was actually using the raw data that we collect in putting it into a hidden markov model That's able to infer to some degree of Certain be that a given school is going to be in a given modality And so this is updated now you can access all the data you can see it You can actually click and you know interact with it a little bit But this was one sort of novel solution I think that was that emerged given sort of the the demand and need for you know a better way to understand school data All right, so really quickly bottom lines We live in at least for me we live in an environment where we have to assume that there's very little data But we still have to deal with it work with it and figure out what is the minimum data or what is the Sort of most important short list of metrics that are necessary to make an informed decision There have been a lot of challenges and a lot of lessons that we've learned through this or experiences In terms of you know, how do we understand the geographic revolution? How do the boundaries impact the way that we interpret the data? All in this larger context of there is this Ongoing epidemic there are variants emerging things are going to change human behavior is going to vary at different times of the year So all of those are you know, kind of an inherent limitation that have to be communicated very clearly And I guess in closing the one thing I do want to mention Is that the thing about decision-making? And I think a lot of this comes from you know recent criticisms about cc data or just the federal response and in collecting data is that At least for our purposes more does not always mean better From the decision decision makers perspective, they're more interested in degrees of magnitude rather than enhancing precision and so typically when we're you know, kind of Met at odds with other sort of working groups or academics who you know Sort of demand or would feel more comfortable with you know, a larger data set or more more Refined data sets What we realize is even if we're able to integrate that data It probably isn't going to impact the ultimate decision because the question is do we act now or not do we Do something or do we do it later? And so that is my time and that's all I got. Thanks everybody Thank you so much. Um, there's some questions in the chat for you, Gloria So one of the questions is are all of these cobit 19 community profile reports generated using r Um No, not all of them. So it's it's it's definitely a mixed bag We have a lot of folks and different teams working to put together the community profile report Is there there's some element of r. I think a lot of cc analysts use r We have contractors that use a lot of python mostly But our environment or data live in a Palantir foundry So that's sort of our sort of our data infrastructure that we use to manage all of our federal data But luckily foundry does allow for you know, the ability to use different languages And so I think that's been helpful for a lot of our younger analysts at least to engage with the the data So another question is um asking if you could provide the links to the publicly available dashboards that you mentioned in your talk Um, so do that and then um another question is are you using wastewater for tracking across the country? Yeah, so that's actually a wouldn't you know one of our elements of like Modernizing our data and coming up with more novel solutions. Um, that's a more recent. I think effort Uh, because I think for the majority of the code response that effort was ramping up to cover more and more Jurisdictions and so I think we've got you know a good handful or at least you know A majority of states that are able to contribute wastewater surveillance data and we're using that as sort of a sentinel Surveillance system rather than you know bothering states and folks to to give us our data So yes, that I think is is is going to be part of some release at some point But it is available on coba data tracker. I believe