 Okay, welcome everybody to our very first keynote of CSVConf 2021. I'm really, really excited to have the incredible team behind the COVID tracking project. For those of you that are not familiar with the effort, it is a volunteer effort that came together in the early days of the pandemic. When data was really, really hard to come by. And there was a lot of missing information everywhere. And so a group of technologists and volunteers came together to fill in the gaps, collect and publish data to help states and other agencies understand what was happening with the outbreak. And the project has done an incredible amount of effort over the past year and they are just, they have just about wound down at this point. So this is a really great time to hear from them as the project wraps up and to learn all the challenges that they've overcome in the past year. I'll do a quick round of introductions. We have Michele Mark, who is one of the data quality leads for the COVID tracking project. We have Kara Scheckman, who is another data quality lead as well. And she will be starting a master's of Stanford very soon. We've got Julia Kodish, who is a data infrastructure lead for the project. And last but not least, Kevin Miller, who is the website lead for the project. So thank you all for speaking here today and I will let you take over. Thank you for the introduction and hello all. Nice to meet you. So my name is Kara. And I along with all Kevin and Julia will be taking you along our journey of building the COVID tracking project, all the way up from a single spreadsheet to the critical data infrastructure it became. So just a quick note before we get started that we're going to be touching on a lot of different things during this presentation. We'll be moving relatively quickly through them just to touch on as many different corners of the project as we can. And so if you'd like to learn more about anything that we talked about today, we've compiled this resource sheet that you can access at this URL. It just has like links to articles that we've written about some of the things that we're going to be talking about documentation around our website. So you can check that out if you're interested in learning anything more about something that we talked about today. So just to provide a little bit of general background on what the COVID tracking project is. So we're a volunteer organization that collected COVID-19 data from the websites of all US states territories in the District of Columbia. We started doing that in March 7th of 2020 because it was really the only way to get the data back then. The federal government first put out a COVID data tracker in May of 2020. And when that data come out, we did an analysis comparing it to our own data and found that there were some discrepancies. So we continue to track data coming from directly from state and territorial sources. We recently wrapped up data collection in March 7th of 2021. So exactly a year after we started, because we felt the federal data had improved enough over the course of the year and that this work is properly the role of the federal government. And we're currently in sort of an archiving phase. We're doing that until the end of the month, just cleaning all the data, making sure everything's tidy and doing some wrap up analysis work. So the COVID tracking project publishes three data sets. The first is our testing outcomes data set which tracks test cases, hospitalizations and deaths across all 56 jurisdictions in the United States. You may be familiar with this from the four of you that we produced each day and posted on Twitter test cases, hospitalizations and deaths. We also have the COVID racial data tracker, which tracks cases test hospitalizations and deaths broken down by race and ethnicity to understand racial disparities in the pandemic. And our long term care facilities data set, which tracks cases and deaths in nursing homes, assisted listening facilities and other long term care facilities, which were disproportionately impacted by COVID. And we're mostly going to be talking today about the testing and outcomes data set just because it's what the four of us here have worked with the most. But whenever we say, you know, talk about a difficult dashboard or a difficult, you know, data set to work with just multiply by 20. And you'll have a sense of the raw data that CRDT and LTC managed to make into really amazing data sets. So the tracking project really started because two staff writers to the Atlantic and Alexis Madrigal and Robinson Meyer wanted to know basic information early on in the pandemic like how many tests were states running. And there really wasn't any data set at the federal level, or any clean aggregates data set that they could find so they started just thinking well this is simple we'll call every state and ask them. Pretty quickly they found that a friend of Alexis's Jeff Hammerbocker had been doing a similar thing. So they joined forces, along with another co founder Aaron Cassane to start a little project and maybe in a thinking maybe in a few weeks. This was early March 2020 they were thinking maybe in a few weeks. The federal government would be doing this and they wouldn't have to do this anymore. In the beginning our website was basically a big button that took you to a Google sheet, but quickly they found that the sheet was crashing from just the sheer number of people who were trying to access and use it. Over time, we became one of the most critical pieces of information about the pandemic for the media so our charts were oftentimes just copied and pasted into graphics on television and print media. So we were within a very short period of time went from a Google sheet managed by a handful of people to a large piece of data infrastructure about COVID in the US. Just a few examples of how we were used we were used by two presidential administrations the Biden administration used our data during their transition planning for COVID. We were cited in dozens of congressional reports we've we were been at least cited for 140,000 times in news articles. We have a lot of mentions and academic citations, and Elizabeth Warren once said that she must be 10% of all of our website traffic. Hi, I'm Michael. So we wanted to describe to you what it was like to build this emergency response organization and to encapsulate what it felt like we often say that we felt like we were building the plane as we were flying it. So here is that dinner that we took that we took. We'll talk about the data entry process which was manual and why we did it manually. We'll talk about how we store the data or database and the choices about what data points we stored in it. We'll talk about where the data comes from because we're just an aggregator and how we created an archive of our data sources. We're going to explain how we presented the data and made it clear and accessible to all. And finally, we will talk about the community that came together to do all these things. So how did we get the data I talked to a non CTP friend a few weeks ago and he asked me. So what did you actually do at the COVID tracking project where you like calling hospitals and asking them how many patients they had. So no, what we did do is look at state dashboards. We would meet every day on the slack channel like five or 10 of us and we would each go to state dashboards and get the numbers and enter them into one gigantic spreadsheet and every data point that was entered. The slide control is a little one piece. Every data point was then double checked by another human and interesting questions were always popping up and we would hash them out on the dedicated per state thread until we had the daily data set that was ready to be published. So huge disclaimer here there's going to be some frustration with the state reporting. We all think that the state and local health departments are the real data heroes of the pandemic. They had to stand up systems to track and coordinate the response, as well as report them out. And overall they did an incredible job. But as the consumers of the data they published, we do reserve the right to vent a little bit. So this data was not easy to get. And the slides are also not easy to get early on there were much less dashboards and we relied a lot on press conferences. We had an amazing team of reporters and outreach people that was led by Kara Oler and artist Chris and they would watch the press conferences and summarize the relevant numbers for our data and treating. And this was also an amazing two way system where we could submit questions to the states. If anything needed further clarification. Another example of how difficult it was to get the data is the frequent use of hovers. And the image that you're about to see shows how hard it was to like find a very narrow strip and hover over it, in order to get the daily number. And this caused us to react by creating an emoji to express our feelings about hovering. There were multiple other issues like grounding and percentages that really made it very challenging to get this data. And all these challenges prompted us to become very efficient in our data entry process. We came up with a great system of source notes. It was created by JD Moresco and Elliott Klug. And then along came Brandon Park, who gave it a huge glow up by creating a formal syntax. Later on this was iterated on and brilliantly maintained by Hannah Hoffman. And this system consisted of a list of links and associated steps on where to get each data point. When you hovered over a data point, and yes, sometimes hovers can be good. You would see very detailed instructions on where to get the data and you could easily click through and follow them. So this system gave us the ability to sail or glide much more smoothly through a data entry shift. Hi, I'm Julia. I'll start here by trying to answer what is probably kind of an obvious question at this juncture. We've talked a lot about all of our manual processes for collecting this data. So why didn't we just scrape it off? It seems like this project is something that should lend itself to that. But unfortunately, the data landscape that CTP operated in was overall a pretty large lack of consistency or standards in COVID data reporting, especially early on, and everything changed really often. So there were changing dashboards, which to begin with many of these dashboards seem like they're actually designed to obfuscate or at least be hard to scrape the data from. And then generally there were kind of lots of changes to the data formats over time, changes to the metrics themselves that are being reported. And then what's possibly the hardest thing to catch automatically, there could be a change to the meaning of a reported metric part of the way through. So even if we'd set up a perfectly automated system on day one, it's wrong by day five in ways that are hard to impossible to notice with automated checks, especially early on. So you still come back to human scrutiny being a core part of our process and ultimately we needed our eyes on the data kind of context and annotations, at least somewhere in the data pipeline. So this is just one example. It's actually one of the better ones. This is Rhode Island's COVID data from July and September of last year. Over the space of two months, they changed the dashboard layout, like you can see here. So disclaimer, Rhode Island is actually great. They provide a great sheet with all the data available, but in many other cases that look like this, this kind of formatting change might also come with the change in the data export also if a state website even supports the data export. And in cases where they didn't, we did a lot of begging of state health departments for CSV files, which sometimes worked. And kind of overall, as the world of COVID data reporting started reaching a little bit more of a steady state, we were eventually able to take more advantage of automation, but finding that sweet spot was something that can only really be done with the experience of starting out with a pretty manual process at first. And even then, once we did set up some data automation, it still needs a lot of manual care. So no matter the bots, you still need the people. Okay, so now we have all this data that we collect very carefully with our eyes and our hands over time. Where does it go? So this part of the talk is going to involve a lot of our feelings about Google Sheets. Just a quick recap of our data entry process. So during data entry shifts, we have a lot of volunteers that would scour state dashboards, and then everyone writes their findings into a giant Google Sheet, which we call Worksheet 2. No one talks about Worksheet 1. There's also lots and lots of metadata attached to every state, and everything is ultimately stored in interconnected sheets. So Google Sheets is wonderful in many, many ways. It's a fantastic tool for collaborative data entry, especially for data structure at the table. It's generic and flexible enough to accommodate lots of folks with different backgrounds and respond to different circumstances like different types of data changes, which came up a lot. Lastly, we probably don't need to convince anyone listening here, but the CSV format is very, very important to general data users, and so it's baked into our data entry processes and pipelines pretty much every step of the way. So all of that meant that for a long time Google Sheets was our world for anything we needed. We had really awesome tooling within Sheets to make all these things possible. It was a mix of Google Apps Script magic and lots of Excel type power usage, and all this powered very kind of domain specific data collection to publish processes. So using Sheets for data entry was awesome. The problem was in the beginning we were also using it as a data store for our time series data, which was less awesome because we started running into a bunch of problems. For starters, our data was too big. We were adding 56 rows and an increasing number of columns at least once a day, and since the project ran much longer than the few weeks, we originally thought that we would need to be doing all this for. This added up pretty quickly and so Sheets performance started to degrade with this much data and people editing all at once during a shift. We ran to a bunch of other issues like accidentally losing edit history if Excel got deleted and then rolling back mistakes was also pretty hard because all of our tooling was dependencies coded within the space of Sheets. So it was amazing, but somewhat brittle machine and problems like that were also harder to prevent within Sheets because as our data got more complex, so did validating it. So along with Zach Lipton, co-leading the data infrastructure team, we added a database layer that took some of the data store pressure off of Google Sheets. The main challenge here was adding a critical piece of infrastructure in the middle of a production data pipeline, essentially amounted to building a second plane while flying it and connecting it to the first plane mid-flight very carefully with no interruptions. Those kind of changes are never easy, but it was worth it for the separation of concerns between data entry and data storage. Just a brief overview of our data stack. So we set up a postgres database on Amazon RDS and then the data in that database became a published source of truth. Then we also had an internal API layer, which was the only programmatic route to the database. This was a flask app that's set between that and our spreadsheets. And it took care of transforming and validating incoming data and would also transform the data to feed it back out to our website and public API. This is getting a little bit in the weeds, but just a few words about the way that we represented our data because we think it's pretty cool. So in order to preserve history, we set it up so that data was never modified. It was only added. And the way the data gets added is through batches, which we can sort of interpret chronologically and play forward as needed. This bar is a bit from the idea of database transaction logging, except applied on the data modeling level. So for example, if we had to go back and revise our data in response to a state data revision, we would know that the newer batch is the data that we should treat as sort of currently accurate. But since we started all of it, we were able to trace these kinds of changes over time. And then we left most of the kind of semantics heavy work to the internal API layer around the database, which meant that we could keep our internal data model pretty simple. And in general, keeping the data store simple as possible was our friend in this work. And having a sort of working model that was a combination of a simple data representation and a layer that would transform it as necessary for all the use cases that we had worked pretty well. And as kind of a more, more general point, all of our tech ended up being on one hand generic solutions. On the other hand, pretty custom and scrappy. On the other hand, we still had things like unit tests and code reviews. So from the engineering standpoint, it was a, it was a cool balance to strike. And then I'll leave off here on the right is very important. This became an emblem of our database with spreadsheets and data and data in the style of other code tracking project logos. This was a brilliant work of emoji by Nikki Campbell, which as a theme will come back later in this talk. So we talked about the systems and the techniques that we use to enter the data. What data to capture and report. So the first data CSV in our GitHub repository was for March 4. And on that day we captured up to four metrics in 14 jurisdictions, and it was a total of 45 data points. On the last day that we captured data, we covered 56 jurisdictions, we captured up to 33 metrics and total of 784 data points. So you can see that we definitely expanded the scope of what we captured and reported. And finally on the reporting landscape was extremely varied and there were many nuanced wrinkles in the data that seemed like they might be meaningful to capture and report. So, for example, we considered capturing the number of quarantine people, coding cruise ships as their own locations, tracking what type of exposure people had and the type of labs that conducted testing. Those were some of the things that were very relevant to understanding the landscape at that point. None of these things ended up being things that we captured and reported. Instead, we found that what was relevant was to provide a national summary on a daily basis of some key metrics. How many tests results were reported, how many cases were found, how many people were currently hospitalized and how many have died. And all these numbers come with huge caveats that we talked about often and we'll do some more later in this presentation. And they were reported very differently by different jurisdictions. So we did our best to cobble together a cohesive picture. And oftentimes this meant that we had to take a metric that we reported early on and expand it into multiple metrics. So one such example is our positive field, which represents unique people with a case of COVID-19. At first glance, positive seems like a simple enough concept and really a binary thing. And in this dictionary definition, it's defined as a medical test that shows a person has the disease or condition for which they are being tested. However, on dashboards, it wasn't always clear what kind of a positive test was reported. And it turns out that different tests could result in a different kind of positive and a different kind of case. So the COVID-19 case definition that was adopted by the CDC is quite long. It's currently scrolling in this animated gif on the left and it defines a confirmed case, a probable case and a suspect case. And those are based on the types of tests, symptoms and exposure that a person had. And when this recommendation was released and then it was updated, it impacted the reporting on state dashboards of positive cases. And as these reporting recommendations rippled through the dashboards, we adjusted our capture to match. So we ended up with three different fields to reflect cases. We added two columns to provide a more nuanced view for the jurisdictions that separated their reporting. And our original field became the catch all for either some of probable and confirmed cases or just whatever that jurisdiction reported in case it wasn't clear. And unfortunately, in a few cases, it still isn't clear what kind of cases a jurisdiction is reporting. And so when we were figuring out the process of or figuring out what kinds of cases that a state reported, that involved a lot of research, because oftentimes when a state puts up a case number on their dashboard, it'll just say something like cases and you have to go deep into the footnotes deep into the weeds to kind of understand what they're actually reporting. And so we spent a lot of time at the COVID tracking project really staring at dashboard footnotes in size eight font hidden at the very bottom of state dashboard trying to understand what was going on, what these states were reporting, because every state was kind of doing things differently. And what we started to learn down in these footnotes was not just the information about what states were reporting, but also the way that it was produced. And for the most part, people in the COVID tracking project really aren't health informatics people, we weren't super familiar with the systems that sort of produce public health surveillance data. And so we kind of realized from these footnotes that we just had a lot to learn about the way this data was produced. And it was going to be a really uphill battle. We soon figured out, as soon as we started to understand sort of like the sort of like the axiom, if you will, of public health reporting in the United States, which is that it's really a state by state thing. It's a very, very Patrick landscape with really different state data pipelines, all at different levels of health. So you have to build up that individual state understanding. And some of them are really falling out of date. They have been underfunded. And so in one state, every test might be automatically transmitted to the health department via the most up to date electronic protocols. And in another state, you may have as many as 40% of case reports coming into the state health department via facts. And you know, people just have to read through these paper reports in order to produce the data that we get on the dashboards. And what we started to realize when we began researching these questions of where the data comes from was that it really would shape what we saw on the other end on a dashboard. And we needed to sort of pay attention to those limits of the data to understand how to use it responsibly because of the way it shaped states overall reporting. So here's an example of why that's the case from Georgia, which is a really common difference in sourcing for state numbers. And this is a difference between the number of total tests that a state conducts, as opposed to confirmed cases in the states. And this one is especially important because these two numbers coming together form a metric that many people use called test positivity to understand viral prevalence to divide the number of cases by the number of tests. So that way, if the state is not doing very much testing, you're not sort of getting an artificially low case number you're sort of waiting it by the total number of testing. So you'd think that total tests and confirmed cases would have the same sources because total test counts the number of PCR tests and confirmed cases of people with a PCR test. But it turns out when states are positive, it turns out when tests are hard to track an estate like lots or sent the effects these numbers actually tend to diverge in their sourcing. And that's because of what's urgent for the state health department. So cases are really urgent for health department to know about because when you have a case, you need to understand or you need to sort of do a lot as a health department. You need to get in touch with people who may have been close contacts, you need to follow up and see if like how that person is diseases progressing if they're getting better. And so states are very thorough. When they count the number of cases, they go through the facts stacks or the stacks of faxes and count every single case and make sure it ends up into their systems. But when it comes to negative tests, there's not really anything the health department needs to do with that data. You know, you test negative and that's good news. The health department doesn't need to follow up. And so most states, many states will either sort of like leave those faxes in the corner and just like go through them all in one day. Which results in a big data dump, or they'll restrict themselves to only counting the tests that are easy to count for negative tests. And that's electronic reporting as opposed to fax reporting. So you'll see here in the total test number for Georgia, if you look at the footnotes, the only source for that number is electronic laboratory reporting. They're leaving out the faxes. And in fact, when you compare the number of positive tests, which you'd expect to be higher than confirmed cases, because it counts repeat tests, it's actually lower. Because it's excluding all those fax reports. And so what you end up with is if you divide the number of cases by the number of tests in Georgia, because positivity, that's actually not about a calculation, because it's top heavy. It's including testing numerator that aren't in denominator. You need to divide the number of positive tests number, which only includes the ELR, the electronic laboratory reporting by the number of total tests to have a responsible calculation. And we tried to sort of provide that context to our data users and make sure that people understood that the way that this data is shaped, the way that it's produced affects how you can actually use it. So one thing that helped us a bit over time with understanding the evolution of this kind of data context, and in general, the evolution of the data over time was having a store of state data screenshots. So I'll just say a few words about our screenshot system, which we built for a bunch of reasons. Most importantly was data provenance. It was really important to us to be able to show our work and where the data came from. This is very closely connected to data quality. We needed the ability to go back and check our numbers against what had been published on specific dates by the states themselves. So we can answer questions like what were the testing counts that New Hampshire reported on March 7. And how do they show that data, which gets us into data annotations. So states would make changes to data definitions pretty often. And this kind of context had a good chance of being reflected in more on the visual version of a website as opposed to like a data CSV. Screenshots also helped us maintain history. So it was particularly useful for situations where we needed to update our data in response to state data revisions. So the database was one side of that. But having older screenshots helped us keep track of what the states themselves put out. And then lastly we did this for archival purposes. So eventually we want to be able to present a record of what was reported and when and have that record be as complete as possible. So we built a system to essentially screen grab state websites. In a nutshell, it was a Python script that used a cloud service called phantom JS cloud to render websites and capture screenshots and also run various kind of custom JavaScript to navigate around as needed. And that behavior was encoded in a YAML config for every state. All of this lives on GitHub. We'll link you folks later. So this is an example of a config. It's structured to our use cases. So there's a state URL. There's some web browser settings like page size. And then we have a message that we used to remind ourselves roughly what this config is doing. The interesting part here is this bit in the middle called the overseer script. This example is far from the most complicated thing we had to do. But really often we would need to do things like click on the fifth tab of this dashboard, which of course doesn't have a reliable selector name so instead it's like click on the fifth HTML element of a certain type. Wave 49.3 seconds for the page to load, click our heels, spin around three times, hope for the best. A lot of the time this worked. Sometimes it didn't. And in cases where it didn't, one of our amazing volunteers would take a screenshot manually and put it into our screenshot story. So along the way we ended up building an accidental brain trust of automatically navigating around ArcGIS and Power BI dashboards. Jonathan Gilmore, Hannah Hoffman, so many folks on the project put a ton of work into maintaining these screenshot configs over time. We actually reached 100% coverage on our main data points toward the end of the project, which was a big kind of accountability milestone for us. So as I mentioned before, we started as a series of spreadsheets but found that we needed to provide more context and visualization to this data, especially as news organizations, media organizations and government institutions were starting to present this data and visualize it, sometimes in ways that would be misleading. So we really wanted to set an example of how to put that data into context. We had a number of principles, first and foremost, everyone should have access to the story of the visualization regardless of their ability or the device that they're using. Over 50% of our web traffic was on mobile devices, so we really wanted to ensure that regardless of what community you were a member of, in terms of access to device, you could get the information you needed. We also wanted to provide consistent and direct charts that people could use as something to look at when comparing one day or trend to the next. We also created an entire visualization guide that talked about how to avoid misleading maps and fancy visualizations that might obscure what's really going on in the ground. And in all of our visualizations, we wanted to ensure that we called out anomalies. Kara mentioned things like data dumps from faxes could really throw off a chart and make it look like maybe there's a huge problem in the state when in fact that's a big chunk of data from another period of time. So as part of making all of our content and our website accessible to people regardless of ability, every chart we had had keyboard navigation. So you could, if you were a person who was had site but was not able to use a mouse because of, for example, motor impairment, you could toggle around all of our charts and maps using just a keyboard. And we also made automated alt text for all of our automatically generated charts, as well as provided every all the raw data for every chart. Every day we produced this series of charts with now you've seen several times we replicated that chart in terms of design and color on our website. And people really told us in our feedback forms that they held on to this chart this chart was the one sort of constant in their lives and I found that feedback really interesting that people were like, there's so much information and everything's very confusing, especially in the early days of the pandemic and just having something that was consistent and clear was really helpful. You'll see in the lower right hand side of this slide, there's two examples of our annotations. So in new deaths here we have these big spikes. So we would always flag those and provide more information when you clicked on them about what why that spike might be not necessarily accurate. We found early on this is a screenshot of some of our state interfaces from May that the presentation of table data in the table layout sometimes created cognitive frameworks that would infer correlation between columns and rows that wasn't necessarily accurate. So for example, in these three states. This column is actually different units. So what we ended up doing is in one of our redesign many redesigns of our website. We instead put data into cards that we could add more context to so every element had a definition so if you clicked on and this is on our website today as well. If you click on a definition will pop up a very specific definition of each field and we also included per data set per state flags if there was something that you should really be aware of. And you'll also see in here examples that says total test specimens and some other states might be different units. As a result, our data page is as bad as long as the CVS receipt, but a lot of people gave us that feedback that having just one big page where you can look at every state was really helpful for them. That's all great but I really want to talk about the great state of Delaware. So Delaware was people for thousands of years the native Lenine Lenape people lived in Delaware and down the eastern seaboard. But then the Dutch came and set up a colony who I won't attempt to pronounce in 1624. The Swedes, which I didn't know lived there had a little colony in 1654, and then they were kicked out by the English and they were like we're going to give it to Maryland. But then Pennsylvania said well we want Delaware and they were it had a lawsuit for over 100 years until these two guys named Charles Mason and Jeremiah Dixon came to settle the issue. And they mapped out the boundaries of Delaware and this is where Mason Dixon line gets its name. The reason I bring that up is Mason Dixon were maybe great cartographers, but they were not very good user interface designers. So as a result, this is a classic example of the kind of interface you would see on newspaper websites or even maybe federal agencies, which was a map of the US that showed COVID data. But the problem was these maps were sometimes the only way to navigate to data it was not only a map that is also a means of navigation. So little Delaware was really hard to get information for if you had motor skill issues and couldn't use a mouse to the fidelity needed to get down to little Delaware. And or if you're on a mobile device where your thumb this is an actual size of the average American thumb covers the entire eastern seaboard. So as a result, when we did want to do state level maps, we show spent a lot of time on developing hex maps that provision for presented every state as a consistent sized target. That also allowed us to make it more of a grid layout so it meant if you're navigating with the keyboard you could use the arrows to move up and down really easily. Another part of being present not only presenting but delivering our data to as many people as possible was we had an API from day one and both CSV and JSON format. Our API is traffic was orders of magnitude larger than our website traffic we were delivering upwards and nine terabytes of API data a month. And that was great because it meant that small newspapers could present to their users a daily automatically daily updated chart of cova data for their state. And a lot of that was powered by the internal API and database that Julia mentioned, and we would then just host it as a static website, which helps our volunteers not have to focus on things like operations. So to conclude, just wanted to share a little bit about the community that made all of this, making the sheets, collecting the data, producing the charts, building the website that made all of that possible. It's a pretty awesome community. So producing this data was a very large group of dedicated, wonderful humans who contrary to this photo were not all named Bob. And it was really when I say producing this data, I do really mean down in this data. This was a manual data set like we've covered every data point had two eyes on it. And like this is what big corners of our slack look like it's just full of reds in slack for checking each state and discussing particular numbers in each state. And so what this amounted to is there was a big community of people around the United States. And in fact, the entire globe was that was recording this experience as we also live through it. And that could be very intense, you know, for many people, I think the project was a form of collective morning, a way to remember each person affected by the pandemic by sort of making sure they were counted in the numbers. But despite the nature of the task and the nature of the data that we were dealing with, we didn't want CTP to be a grim place for people to log on to each day. And in the end, I don't think it was. Rather, we wanted people to find meaning and connections through this work to support each other through the pandemic and also the task of conflict. And so we took very seriously creating the virtual place, the slack where we gathered. And this is a piece of art that our data science lead Dave law made. It's called emoji trope it emoji trope lists. And I think it speaks pretty well to that. So, even though it's not quite obvious, it's actually a graph, it's a graph of custom emoji that were uploaded into our slack or virtual home, and it's bucketed by week. So the emoji that were uploaded each week to create something that looks like a city skyline. And I just wanted to tell the brief history of how we came to have so many emoji in our slack. And it's a little bit of a silly yarn, but I think it's a portal to understand our community and also the work that came out of that community. So just for context, this is our logo. It's the circle with the little art around it, and two really fundamental and consequential questions were first posited about this logo in May of 2020 that ushered in the emoji ages we know now. So the first was posed by our community lead, Amanda French, who had the good sense to wonder, why is there not yet a logo emoji. It's been two months with this project. And that was quickly remedied. And then the lead of the COVID racial data tracker Alice Goldfarve asked the second pivotal question, which is what if it's fun. And what this kind of did was it spawned a realization that our logo is infinitely memeable. And the first grip I was sort of able to track down that slack was a few days after the CTPs consequential CTP spin logo was uploaded. Someone wrote to our email to ask whether he might have permission to use our logo as his car companies insignia. And we said no, but we did try out the concept for ourselves just to see what it looked like. And this kind of opened the floodgates. Before you knew it, you blink, and then you'd open your eyes and there would be 20 new CTP emoji in the slack. And then you blink again. And there would be 1000 CTP emoji in the slack. And then, you know, you go and grab a cup of water and maybe take a walk around the block. And then you come back and the official CTP emoji team led by Nikki Hamburg was organizing March Madness emoji bracket to choose the favorite CTP emoji, the winner of which, by the way, was CTP. This is fine, which I did draw enough to brag. And, you know, this says a lot about CTP, that it's a community of extremely creative people just full of energy who really enjoy building things together, whether that's a creative data infrastructure or a critical data infrastructure or an emoji infrastructure. And what you see with emoji really is kind of the way things felt on this project in general. You just like blink, and suddenly someone had spun off like 50 spreads and a spreadsheet and a million other things. But I have one favorite emoji, or really 10 favorite emoji, because it's the emoji with most variation in our slack other than below itself, but I think really cuts to the heart of CTP and organization. And that's the thank you emoji. There are 10 ways to say thank you on the CTP slack, including a few that are specifically devoted to Kevin, which I think tells you about many miracles that Kevin has worked on our project. But CTP has a real culture of gratitude. There's even a channel of gratitude that's just dedicated to thanking people. It's a place, I think, where people really wanted to show that they cared about each other, that they cared that people were putting in the work to make this sort of to fill this fundamental hole in the US pandemic infrastructure. And that care was at the heart of our work, I think, the care for each other was care for the data as well. And we wanted to do everything that we could to get to do the right. So we had over 800 volunteers in data entry, data quality, outreach, reporting, infrastructure, communications, visualization and the website team. Just in data entry, we calculated over 20,000 hours of time were dedicated to the COVID tracking project. This were people were who many people volunteered full time, many people volunteered more than full time. They were all doing volunteer work that was really filling gaps that other institutions should have been doing. And I think I just want to commend them for that. This has been a very hard, difficult, tragic year. And the people that you see here, along with hundreds of others who are not pictured in the slide have both sustained me and each other. I'd like to quote from Aaron Cassane who's one of our co-founders. For many of us assembling a daily count of the sick and the very sick and the dead was an act of service that kept us going through a difficult year. But it also came at a high cost, especially when we found ourselves working with numbers that include digits representing our own friends and family members. For many of us doing the work together in community with each other made all the difference. I'm going to hit the next slide for it. So here's the resources again, we have a lot of links to blog posts and especially lately on our website. You can see a lot of analysis and updates that we've done in a retrospective format that I think are very powerful and interesting that I would encourage you to check out. And here is all of our contact information. And I think we'll now take it to questions. Thank you so much. Mihao, Kara, Julia and Kevin for such a great talk and thank you for being part of such an incredible effort. We're so glad we got a chance to hear about it. I do have a few questions from from the audience and I'll start with a question from Paula, which is, did you encounter issues with data licenses? Once that we're not open, for example. I wanted to say something to this. Not so much licenses per se, but we had a lot of instances where we would write to states and ask them for specific data or time series data to smooth out data dumps. And a lot of times what would happen was that they would send it to us in a CSV or Excel. And we didn't want to use data that like on the testing and outcomes side of the project that wasn't publicly available. So we did a lot of back and forth of just asking them to publish the data themselves on their website so that it will be available to the public. And sometimes it worked and sometimes it didn't work. Sometimes we even have to go back and scrape their power BI for the same data just so we know that we are using publicly available data. I also know that Puerto Rico I think ended up being the only state that had an API. And we did want to use that API and there was a data issue with that we were chatting with them and they tried to work on it. And as far as I know it never got resolved. We have another question which is a little bit depressing which is that sadly this isn't going to be our last pandemic. We're likely to run into something like this in the future. Does the COVID tracking project team have any plans to write some sort of playbook for future volunteers that might take on a similar effort or for organizations like the ones that you've been dealing with to be able to better provide more usable data. Because something that I've seen often is people want to do the right thing but they have no way what the best practices are. So do you have any plans to do something along these lines. I would say we've done some of that work already in terms of the posts that we've made to our website that have been much more in descriptive in terms of how we work rather than just what kind of analysis we're doing. We are working on, as we mentioned before, and like an official archive to ensure that not only our data but our organizational structure methods are preserved. I'm not sure if anyone else wants to speak to that. I think just a quick follow up in general sort of at this part of the project, even though our public data collection stopped on March 7, we're now kind of in the phase where we are actually trying to think about that question and answer it as best we can. And we're very much still putting out blog posts on co-attracting.com, a lot of which try to get at the kind of different parts of the system. And there are components to pretty much all of those posts that are a little bit of like a how to or takeaways or things that would hopefully, not hopefully but hopefully come in useful if something like this ever needs to be done again. Great. I have a time for one more question. I'm not sure if this is already covered, but how often did you run into proprietary data formats that were not as simple as CSVs that you could just easily process? If no one else wants to talk about this, I can. I actually have a recent blog post that talks about all our automated data scraping, and there were PDFs, HTML, CSV, Excel, and also just the APIs that are powering different dashboards and also hacking dashboards that all participate things. It's all there and the link is in our resources. How often I would say all the time and sometimes there just was no data available at all. Yeah. Well, thank you very much again for this great talk.