 Well, hello everyone. I am Dan Bernstein and I'm a research analyst at the Legal Services Corporation, and I'll be speaking today about my office's work to collect and analyze very messy and very diverse data from across the U.S. civil justice system. So I hope that you come away from this talk with a better understanding of how civil court data can support more informed policymaking and help organizations promote equality in the justice system. My talk is focused around a few questions that are listed on the slide, and while I'll be talking about a lot of the lessons learned from this past year of work, I will mention that this project is ongoing, and we're eager to engage with others in the data space or the legal space to inform our approach to data gathering, data sharing, and data privacy. So very briefly, I work for the Legal Services Corporation, which is a nonprofit that was established by the U.S. Congress to fill the justice gap. The justice gap is a difference between criminal and civil court. In U.S. criminal court, if you cannot afford an attorney, you can have a public defender appointed to represent you. This protection does not exist in the U.S. civil justice system. If you are facing eviction or debt collection or attempting to address domestic violence in your life, if you cannot afford an attorney, you are forced to represent yourself in court. Many studies have shown that your odds of winning a case are significantly better when you have an attorney. So this is a large issue for low-income individuals who face many civil legal issues, but cannot afford legal representation. LSE, the Legal Services Corporation, receives almost half a billion dollars a year to fund civil legal aid. We provide this money as grants to local legal aid providers throughout the United States and the territories that use the money to help low-income individuals with their legal issues for free. The grantees mostly represent people in court, but the data about this larger civil justice system is not usually accessible to the people in the courtroom. Within LSE, I work in the Office of Data Governance and Analysis, and this is a new office that was created in 2015 that was supposed to help the organization and our grantees around the country better use data in their work in operations and outreach. And so in the past year or so, our office has realized that there is a need to better understand how legal issues affecting low-income Americans actually play out in the court system. And so that has led to a project that I've spent the past year working on. We call it, you know, locally the Civil Court Data Project. And our goal is to use web scraping to gather court records from state and county courts across the United States for a few reasons. First, we want to answer very basic questions about how low-income individuals experience the civil justice system. This includes questions like how often are low-income individuals represented in eviction or debt collection cases, and how does legal representation correlate with outcomes? These questions seem very basic, but there has been little research that uses large amounts of data to describe the landscape of civil justice and poverty. We also want to build tools such as dashboards that help our grantees better align their work with their local needs. If we can determine parts of a state or city that are experiencing a high amount of legal issues, our grantees can adjust their outreach to target high-need areas. All of these questions revolve around helping legal aid providers, which are organizations that traditionally have not had access to large amounts of data or data analytic expertise to help them better serve their communities and communicate to funders that their work is important in changing the lives of low-income Americans. To give you an example of how we can use court data, about two weeks ago, my office received a request for data about garnishment in the state of Tennessee. If you lose a court case and owe someone money, the winner can file to garnish your wages or bank account, which means that they can directly take money out of your paycheck before it is deposited. We were asked by a group of advocacy organizations in Tennessee to give them any data that showed that private debt collectors were still trying to collect outstanding debt, even as millions of Americans lost their jobs to the COVID-19 pandemic. The advocates wanted to convince the Tennessee Supreme Court that the court needed to put in place an order that would forbid private debt collectors from garnishing the $1,200 stimulus payments that Americans received in April. I got to work assessing the availability of court data in Tennessee, and in less than 36 hours, we were able to gather and analyze five years' worth of court data, which included over 300,000 civil cases. And we clearly showed that almost 2,500 people just in the city of Memphis had experienced some kind of court activity related to a debt collector trying to garnish their wages in the month of April. We were also able to highlight many cases where employers specifically identified the pandemic as a reason that someone had lost their job so the employer would not be able to withhold any wages from these individuals. The advocates deliver this analysis to the Tennessee Supreme Court, and it will be used as they make a determination about their next steps. To give you a sense of what the data generally looks like, we can take a look at this example. Here we have an example of a web page from Tarrant County, Texas, which is the county including the city of Fort Worth. This web page comes from the public access web portal where you can look up any case that has been filed and entered into the system. In this example, we can see that there are many useful elements from a data analysis perspective. We have a case number that uniquely identifies this case from all other cases in the system. We also have information about when the case was filed, what the case type is. Here it's an eviction case from the year 2000, and we also get party names and street addresses which enable geospatial analyses. We also get the names of any attorneys associated with the two parties. In this case, neither party is represented, so the lead attorney for both parties is identified as pro se, which means on behalf of themselves. We also have information about who won the case and how much money the loser owes the winner. All these pieces of information could help us answer the questions we are after, but they're stuck on this HTML web page rather than in a clean, tabular format that's ready for analysis. My work is to get all of these case files and extract information for analysis. But as we'll see, it's not as easy as just writing one script for scraping and another one for parsing. There are many barriers to gathering this data that we have had to overcome that we continue to address with every location that we attempt to study. First, not all states provide access to civil court records. This map shows the availability of court records at the state level. Only nine states make data available in a manner that we consider available. This means that a single website contains multiple years of historical case records like we saw before for all counties within the state. This also means that a state website has not imposed any limitations that prevents scraping, such as a CAPTCHA, a paywall, a login or terms of service that explicitly prohibits scraping. The nine states that are available provided decent diversity of geographies, but we're missing data from the Western Southeast and New England parts of the United States. So we've also investigated the availability of data at the county level. There are many counties in the United States that have more residents than entire states. We went through a process of assessing the 100 largest metropolitan areas for the availability of court data, the same way we did with the states. And we also leveraged data on the prevalence of eviction and debt that others have published to prioritize where we use our resources. In the past year, we have conducted data collection in the locations on this map for the period of 2000 through the end of 2019. To date, this data includes over 25 million civil case records in the states and counties for states and counties that are home to over 40 million Americans. We're still missing geographies in the Pacific Northwest and New England, but we have filled in many gaps and will continue to pursue counties as our time allows. We're also currently building architecture to allow us to scrape and parse data in near real time so that as the United States begins to come out of the pandemic, we can monitor the patterns in case filings and outcomes. We'll want to see how landlords respond to the reopening. Will courts be inundated with eviction cases and how does that vary across states in relation to the policies and laws that they have in place? And overall, what is the geographic distribution will be another key element that we'll be studying. Another major issue is the structure of the data. There's no standard format for court data. Here we see five examples of five different formats. They all display similar information about court cases, but the HTML structure and the variable names are all different. In some areas, we can use the same scripts for scraping and parsing because counties in the same geographic area seem to cluster and use the same website technology providers. But for the most part, we have to develop custom scripts for every jurisdiction that we approach. Having similar data in different formats also forces us to address the question of standardization. We either have to preserve the format and naming conventions of the individual data sources, and our analyses would be more complex and time consuming. Or we have to make assumptions and impose a data model on the data sets that would make cross-site analyses easier, but lose the nuance of the individual sources. We have chosen to not standardize data up to this point. Our intention is to extract the data in these web pages and use the original naming conventions and structure so that an analyst has an easier time understanding how the clean data relates to the raw HTML data. But as we pursue analyses across jurisdictions, we will have to make some assumptions and do some standardizations for specific analyses. To further describe our workload, here's a basic diagram of how I imagined my workflow would be when we started this project. I thought that we would evaluate the data on a given website. If the site was suitable, we would scrape all the case numbers and then look up each case number to get the associated case record, and we would then have all the case records. Then we would simply parse the HTML, clean the parse data, and produce beautiful tabular data and documentation ready for analysis. But as you might guess, it didn't work out that way. We found that there were issues at nearly every step in the process that forced us to go back and iterate on scripts and strategies. When working with data on the web, you might have issues related to connecting to a server or issues retrieving data. The server might send back errors if you query the website too quickly or if the server might just be inaccessible during certain hours of the day due to maintenance. Sometimes we would get all the way to clean data before realizing that a substantial portion of the data contained errors that we could have detected if we had built in certain explicit verification steps earlier in the process. After overcoming all the issues in web scraping, there are other issues in extracting data. Some web pages might look like they contain a table, but due to errors in the underlying HTML, there might actually be missing HTML tags that ruin what would otherwise be a pretty predictable extraction process. And after parsing, we've had to make assumptions about the data in type conversions and feature engineering. For example, there might be an element on page that is called money owed, which you assume is a numeric value. But after cleaning the data, you find all kinds of wonky values that you never would assume would be under that variable, which forces you to go back and iterate on how you clean information from that section. So when dealing with thousands or millions of items like we are, we needed to really understand the diversity of potential values by studying dozens of HTML files from each site to really understand them before we were able to clean the data. So as I mentioned, we have recently begun developing the cloud architecture needed to do the scraping and parsing on a regular basis. When you're doing this data collection just once for historical data, there is room to iterate and repeat steps when they fail. But in an automated workflow, we really need to fully understand the ways a website operates and represents data so that we don't have to manually go in and fix code and data frequently. So in developing these new workflows, I've reflected on the lessons I have learned from developing dozens of scrapers and parses over the past year. The lessons really break down to three areas. Be respectful, be resilient, and be resourceful. First, we need to understand that the scrapers do impose a burden on the websites they are accessing. If you query a website too quickly, you could have your IP address blocked or take the server down momentarily. It's important to monitor how long it takes the server to respond to your requests and to add delays in your code to ensure you are not overburdening the website servers. The Urban Institute has developed a simple tool called Site Monitor that is available on GitHub and monitors your requests to a server and will dynamically adjust the time between your requests so that you don't overburden servers. I strongly recommend this tool to anyone developing scrapers that will access websites for thousands or millions of queries. The next lesson is to be resilient. Many of the errors in our workflow came from not fully understanding the websites we were scraping. There are predictable errors such as incorrect inputs and also unpredictable ones such as server errors, downtime, and IP blocking. You want to manage as many of the known issues upfront and also build mechanisms to learn about unknown issues as they arise. I spend a good amount of time upfront just manually investigating a website to understand what query patterns will trigger errors and what are the HTML, CSS, and JavaScript patterns that occurred during the errors to build mechanisms to overcome them. We have also built in verification steps for our workflow. After scraping all of the case records and parsing them, we explicitly compare the number of raw records to the number of parsed records and thoroughly investigate any discrepancies. We do the same at every stage of the process. Finally, be resourceful. When you're automating web queries, each decision you make has ripple effects. If you include a code snippet that takes twice as long as an alternative, that doubling propagates in every code iteration. This can drastically lengthen the time it takes to scrape and parse court records. When dealing with web scraping, you often have to choose between either using simple requests to a server or using a slower, headless browser approach that actually renders every web page that it encounters as you navigate through a website. If you thoroughly investigate the network activity required to access web data, you might find that you might not actually need a slower, headless browser approach and a series of HTTP requests might actually suffice. This will greatly speed up your workflow, and most browsers include great network analysis tools that make this process very easy to understand and to mimic. We've also learned lessons about creating large datasets. When I have used public datasets in the past, I never really thought about how difficult the decisions must have been for the creators to take disparate datasets and put them into a format and make those assumptions and document all those things. So the first lesson is to be clear about the intentions of your dataset. We continue to debate the relative merits of preserving the structure of raw data or standardizing across different websites. The ultimate decision really comes down to how people will interact with the information. If you're building a small number of products that are specific to individual websites, you could build custom tools that use the structure of the raw data. Or if you need to provide interpretability of the data, preserving the raw naming conventions will definitely help analysts compare the raw datasets to what you ultimately produce. And we've also learned a lot about building data products for users. We often need to make decisions about what data is suitable for different use cases. If the data will be used for operational decision-making, such as monitoring the civil legal fallout after the COVID pandemic, then having updated and accurate information readily available is important. And the structure of your architecture and the data model that you produce will follow. So I want to thank you all for your attention and I want to thank the CSV com for putting on a brilliant conference during these difficult times. And as I mentioned, we are very open to learning more about how this data could be used in other contexts. So please get in touch with me if you're interested in collaboration. I can be reached out on my email or on Twitter. Thank you all. Yeah, thank you. It's really great. I was just slacking with other organizers that I get teary eyed when I see these kind of like grabbing all the information, civic information off of all these websites and then normalizing it and thinking it through. It's just like amazing how much power is available as a is out there that is not being leveraged. So I mean the work that all these different projects including yours is doing to just unlock this is amazing. We are running right on time so I'm going to say that the questions can be moved over to slack. And thank you very much for the presentation.