 My name is Diane Goldenberg Hart. I'm the Assistant Director of the Coalition for Networked Information, CNI. And you are joining this webinar on the books, Jim Crow and Algorithms of Resistance as part of CNI's Spring 2020 virtual meeting. And we're really delighted and grateful to you for making time out of your day to join us for this really interesting topic. Our presenters today are Lauren Brekner, Amanda Henley and Kimber Thomas, all of UNC Chapel Hill and Kimber is a clear postdoc fellow. We are delighted to welcome our speakers here today and we're delighted to welcome our attendees here today. Before I hand it over to our panelists, I just wanna draw your attention very quickly to the Q&A box. There's a little button at the bottom of your screen that says Q&A. If you click on that button, a window will pop up and you can type your questions into that window at any time. Please feel free to type your questions into that window at any time and we will field those questions live after our presenters have given their presentation. I also want to draw your attention to the chat box. Again, click on the chat button and a window will pop up. We'll be sharing some links and additional information in that chat box and you are also welcome to chat with us or with everyone attending through that chat box at any time. And if questions come through the chat box, I will share those with our panelists for addressing later in the talk. With that and without any further ado, I want to welcome again our speakers and hand it over to Kimber who will get our talk started today. Welcome Kimber. Thank you, Diane. Following the defeat of the Confederacy during the Civil War, many African Americans found themselves experiencing a little taste of freedom. Many voted despite violence, some held office in many parts of the South and civil rights laws remained on the books. But this would soon change as many white Southerners found ways through a codified system of racial apartheid called Jim Crow to hold on to their civilization and deny African Americans their rights and freedoms. The Civil Rights Act of 1875 prohibited racial discrimination in public facilities such as transportation and guaranteed African Americans equal access to public places such as restaurants, theaters and hotels. The act recognized the equality of all men before the law and noted that it was the duty of the government and its dealings with the people to meet out equal and exact justice to all of whatever nativity, race, color or persuasion. African Americans had some protection under the law as those did not the full enjoyment of any of the accommodations, advantages, facilities or privileges outlined in the Civil Rights Act of 1875 could expect that their rights will be upheld in the court of law. But eight years later in 1883, the US Supreme Court ruled in the civil rights cases that this act was unconstitutional that because restaurants and hotels, et cetera, were private locations, their owners could make those decisions at their own discretion. The civil rights cases spurred the emergence of Jim Crow laws in the United States and less than 20 years after the end of slavery, African Americans found their pathways to freedom blocked by new laws meant to segregate on a racial basis. This project uncovered some of the many laws that were put in place to maintain racial segregation in the US between 1865 and 1968. The system of Jim Crow meant that white authority and supremacy was grounded in black difference and segregation which meant that buses and bus stations, bathrooms, cafes, churches, civic clubs, cotton jeans, movie theaters, schools and more were all separated by race. The laws uncovered here demonstrate how for African Americans especially, Jim Crow affected nearly every aspect of their daily lives including a tax on their freedom, respect and dignity. Next slide please. Laws that segregated cemeteries, funeral homes and hospitals add depth and texture to the history of race in the American South and demonstrate how the denial of African American rights and freedoms were not only enacted through everyday social interactions but were made legal by a series of racial statutes called Jim Crow. So I understood that these Jim Crow laws are pervasive in the North Carolina general statutes. However, a comprehensive listing of all of the Jim Crow laws does not exist. Our project on the books Jim Crow and Algorithms of Resistance was motivated by a reference question from a K through 12 teacher in North Carolina who was looking for a comprehensive listing of Jim Crow laws. One of our librarians from Special Collections realized that no comprehensive source existed and worked with others to determine the feasibility of using text analysis to identify the laws. We proposed the idea to the collections as data part to whole project and were funded as part of cohort one. We are creating a text corpus of over 100 years of North Carolina public, private and local session laws and resolutions and we're using text analysis to identify discoverable North Carolina segregation statutes enacted during the Jim Crow era. The collection we're using for this corpus was digitized as part of another project that was done during 2009 and 2011 under an IMLS grant called ensuring democracy through digital access. This project was a partnership with East Carolina University, the State Library of North Carolina and the University Libraries and North Carolina at Chapel Hill. The best existing source of this information is the State's Laws on Race and Color, a book described by Thurgood Marshall as the Bible of the Civil Rights Movement. This work was compiled by Dr. Reverend Pauline Murray who you see here on this slide who is a lawyer, priest, human rights activist and co-founder of the National Organization for Women. The book catalogued racist laws in every state of the country including Murray's home state of North Carolina. Our project is building on the work of Dr. Reverend Murray. When she was compiling her book, the methods that we're using did not exist. She identified laws by going through print volumes. Richard Pascal, a North Carolina legal scholar has done similar work to identify North Carolina Jim Crow laws. We've built a training set to locate Jim Crow laws programmatically using the laws identified by Murray and Pascal and laws classified by scholars on our team, William Sturkey and Kimber. I review hundreds of laws provided by the project team in order to classify these laws as either Jim Crow laws or not. This work helped the project team to build a model to detect similar laws throughout the corpus. My process for identifying the laws was fairly straightforward. Laws were separated into two categories based upon my classification of them, yes or no. Laws that I classified as yeses or Jim Crow laws presented evidence of legalizing or enforcing racial segregation. Many of these laws outlined the ways in which the white and color should be separated either spatially via the allocation of physical resources or other means and often included words or phrases such as shall not, shall never be compelled and separate. To the contrary, laws identified as no's or non-Jim Crow laws presented no evidence of legalizing or enforcing racial segregation. Next slide, please. My classification of these laws included very little analysis. That is, while there were laws in the corpus that when interpreted could have had implications that would have supported Jim Crow practices on the ground, I focused specifically on identifying laws as positively Jim Crow that explicitly legalized or enforced racial segregation. This slide and the next show examples of the types of laws that I encountered while identifying laws. One is not a Jim Crow law and the next slide is a yes. The last law had to be cropped as it was pulled from a four page section but it outlined all of the ordinances that the Board of Aldermen had the power to enact in a particular town. So in addition to the scholar's perspective that Kimber and William contributed, this project has required a wide range of additional skills that's included a deep knowledge of the collection that we're using to compile the corpus, knowledge of legal information, project management and administration, a range of technical skills including programming, skills needed to prepare unstructured data for analysis and for this project, our technical leads were using Python, also version control, optical character recognition and text analysis. As you can imagine, these skills are not housed in any one department or even library for that matter. One of the ways that we're working across units is through a special collections digital scholarship working group that was convened in 2018 by Maria Estorino, who is our AUL for special collections and director of Wilson Library. Maria's goal in convening this group was to encourage collaboration between staff members working in special collections and the functional specialists in the digital research services department. There's been a lot of enthusiasm and interest in this project and as a result and as you can see from this slide, our project team is very large. Team members include functional specialists from the library's digital research services department, librarians from special collections and scholars. We're fortunate to be at an institution with staff resources that has allowed us to bring in specialists as needed. We've relied on others for expertise and metadata, legal information and software development. And I'll talk a little bit about our workflow and processes and preparing the corpus. Our first task was to compile a comprehensive listing of volumes for our corpus, which starts with 1865-66 and goes through 1967 and contains 96 volumes. We used a custom Python script to download the JP2 images for these volumes from the internet archive using the internet archives API. We then prepared the images for OCR and these images show pages from two different volumes of two different years and illustrate some of the challenges that we worked through with these images. So you'll notice here several things. There's marginalia that's on the outside of the page. The marginalia is really just a finding aid for label scholars. It's not part of the laws and for our purposes really just needed to be removed. You'll see that the marginalia is not in the same location on each page. Also that it comes quite close to the actual text of the laws. You'll notice also how the pages differ from volume to volume. So there's different contrast, text size, font, et cetera. And you'll also see that the image on the right side of the slide here is skewed. So these images show the changes that we made after the pre-processing. The images were cropped to remove headings and marginalia and the images were rotated as needed. And then lastly, blank color balanced margins were added back to the images to improve the OCR performance. OCR does not perform well if the word goes right up to the edge of the page. And all of this was done with custom Python scripts that use pillow. We use these corrected images to OCR over 80,000 pages with Tesseract. We've also devised a metadata schema for XML output. And we got this kind of nailed down with some assistance from our metadata librarian. The elements that we plan to include are volume, title, date, range, law type, chapter and section. So in addition to creating the corpus of all of these statutes, we also have the ambitious goal of analyzing the text of the corpus to identify Jim Crow laws. The volumes are organized into chapters and sections and our unit of analysis are the individual laws which are organized as sections. So in order to be able to do this text analysis, we must parse out the chapters and sections or laws. There's a lot of challenges with doing this. And in fact has taken, this has taken the most of our time I think is parsing out these laws. But there's a lot of errors in the original volumes such as missing chapters or chapters that were listed twice. The chapter labels are located in the margins for some volumes and of course we removed the margins. The chapters for some volumes are shown as roman numerals which did not OCR well at all. And also there's a fairly low accuracy on the OCR of numbers. So you have confusion between threes and eights and fives and twos for example. We're using regular expressions to fix common errors and student labor to split volumes that we simply could not split programmatically. This is probably gonna be the biggest limitation in our corpus. But right now we already have over 200,000 sections split. And as we continue to improve the splitting process, the number of laws that we have in our corpus will continue to grow. So I'm gonna talk about the text analysis part of this project and some of the machine learning techniques that we're using. One way to develop a model that can predict which laws are Jim Crow laws involves training that model to recognize them. And so currently we've been preparing a training set and this is a really long process because it has to be done manually. We need people to actually read a random sample of laws and tell us this is a Jim Crow law or this is not a Jim Crow law. So some of those Jim Crow laws have already been identified for us by Polly Murray and Richard Pascal. And we're also working with the African-American Studies scholars, William Sturkey and Kimber Thomas. And that's a good thing because if you think about what kind of laws are going to be in these books, a great majority of them are going to be about things like whether oysters can be harvested from a certain stream or whether two railroad companies are allowed to merge with each other. A lot of stuff that isn't related to Jim Crow. So far our training set has only 100 laws that we've identified as Jim Crow and then 800 identified as not Jim Crow laws. And what that means as you might expect is that our model right now is much better at telling if something isn't a Jim Crow law than it is at telling if something is a Jim Crow law. So we are focused on better developing that training set right now. Next slide. So the models that we've been experimenting with so far include a 90 phase model and a gradient booster. The 90 phase model performs better. It has a 90% accuracy score. Although that score is the result of the accurately classifying laws that are not Jim Crow laws. So again, we're working on our training set so that the model can do better at identifying the laws that are Jim Crow laws. Next. So another way to approach our problem involves using unsupervised machine learning. And this is where we don't use a training set. Instead we let the machine decide how it's gonna group the laws itself. The type of unsupervised learning we're currently playing around with is topic modeling. So this is a tool that basically is going to look for similarities and differences in the law texts and then divide the laws into groups based on that. Next slide. This is a screenshot showing the results of our very first unsupervised classification. Now this topic model is only based on a 5% sample of our laws. It produced some topics that seem to make sense. Like up here you see civil and jury and criminal and chief all put together. But none of these topics really look like Jim Crow or Jim Crow laws. So our next steps here are to do this sort of modeling again but on our entire corpus. And we'll be able to do that once we overcome some of these parsing and annotating challenges that Amanda spoke about earlier. We'll be experimenting around the number of topics we use in the model and see what kind of thing we end up with. So at this point I wanna point out that training a model to accurately predict Jim Crow laws is not the ultimate goal of this project. It's just one element of the project that we hope to learn from. We don't know if it will be possible to create a model that can accurately find all Jim Crow laws simply based on the language, right? But we intend to teach people about our process and some of its limitations regardless of what we end up ultimately achieving as the outcome. So yeah, so our main deliverables then our outreach and education, right? So we've presented our project to multiple audiences including librarians, digital humanists, K through 12 teachers and we hope to continue presenting on it. We'll be providing a white paper on this project. We'll also be publishing a website and the website is going to make our corpus all of this text that we've been putting together for these laws available for anyone to download and for other researchers to work with. So we've planned to assess that website after three years to determine if maintaining it is still necessary and if not, that text is still gonna be retained through our Carolina digital repository. And we're also providing a repository on GitHub, not just for code, but also for a number of Jupyter notebooks that are going to explain our code. And in this example, here's an excerpt from one of the Jupyter notebooks that we've put together on how we remove the marginalia. So these notebooks not only include documentation on the different functions that we've created, but also how we use those functions in Python. They basically provide tutorials on how to do what we're doing with some text and images that help explain our code and how it works. So as for our next steps, Carolina K through 12 is going to create a curriculum using our deliverables. We're seeking some additional funding to further our work on improving that law splitting. We wanna create some search functionality on the website so that people can do keyword searches to find specific laws. But we want to also improve our machine learning models, expand our analysis, not just figure out if we can classify the Jim Crow laws or not, but also do some exploratory analysis here. Maybe see if there are trends over time that we're seeing in this text data, geographic trends across different counties in North Carolina, things like that. We'd like to create some open education resources for use in academic classrooms. And we wanna investigate the potential of using these methods on laws from other states. Should we take this to South Carolina? Is there maybe an opportunity to look at this on a national level? So before I wrap up, I just wanna address a question some of you might have about why we're doing this kind of work at the library. I think a lot of people might expect research like this to occur within a school for computer science or something like that. Next slide. But we have a new data science initiative that we're very excited about here at Chapel Hill. The Dean of our School of Information and Library Science is the chairperson of the steering committee for this initiative. So we very much see libraries and librarians as having a major role to play in this. And the initiative is gonna include a brand new school for data science. So we as librarians will be tasked with supporting data focused research and education and we need to have that experience and expertise ourselves. And also collections at our libraries are becoming data, data are becoming collections. And as librarians, we need to continue to interact with those collections in some way. So projects like this really make that possible. And thank you very much. We are looking forward to hearing your questions. Thank you, Lauren. Thank you, Amanda and Kimber. That was a fascinating talk, really so interesting. And the first thing I just wanna share with you was the thought that I had, which also someone from the audience mentioned right off the bat, which is that providing Jupiter notebooks for this work is awesome, truly a fascinating way of using Jupiter notebooks, really interesting. So with that, let me just start off because we already have some questions in the queue. That comment about the Jupiter notebooks was from Jeff Oliver, who also has a question, which is, what is the time estimate in person hours for creating the 900 law training set in the supervised learning approach? I would say, hi. Amanda, I wonder if you would... I think that Kimber is probably the best person to speak to how much time she's spent reading laws. I agree, it has been high. What do you say, Kimber? So I didn't do 900 at one time. We sort of broke it up. I would do 200 at a time and then 100 and then 20, but it usually took me for 200 laws, it would take about two weeks. So it took maybe like two months to identify 900. And the other part of that would be pulling the laws that Pascal had identified and that Pauline Murray had identified. And so pulling out those, she had identified some that were in the code as well as in the statutes and we're just looking at the statutes. So it took a bit of time to go through and pull the ones that are in our data set. I'm gonna say probably, we'll say a few days worth of work there as well. Okay, thank you. Thank you, Jeff, for that question and thank you panelists for the answer and just to follow up, Jeff comments. Thank you for the answer regarding efforts on the training set. I ask about the efforts because this is an often overlooked part of the labor behind a machine learning approach. So something for everyone to keep in mind. I can say that without Kimber's expertise and William's expertise, we would not have been able to do this work. We really needed to have their judgments on these laws. The rest of the project team are not historians or scholars in African-American studies. So we had to have them absolutely. Okay, thank you. Moving on to our next question, which comes from Carmelita Pickett. Carmelita asks, can you discuss how the laws will be organized for this project? For example, laws that were specific to labor and wages. For example, last year, Governor Northam repealed the law that allowed employers to pay less than minimum wage for specific jobs typically held by African-Americans. Any thoughts on that? Well, Lauren, did you want to speak to that? Yeah, one thing that we would in, I think our most idealized version of this project would be to tag all of our laws once we have them parsed and separated and made available in some form, perhaps a database for downloading online, would be to have them tagged to be able to assign them specific subjects or topics or even places so that you have a high level of search functionality for people who are interested in looking at that kind of thing. That, however, is going to be quite a lot of work that in itself is a project. So I'm not sure if it's within the scope of what we're doing now. There's a lot of ideas in that arena though, I think. One of the things that we've been considering is producing multiple corpora because we're finding that at first, I think we thought we would find all of the laws and then we would go through and contextualize all of them and list them on our website. But there's a lot of them and probably the best way for us to move forward, at least for the phase one of this project is to provide multiple corpora. So one corpus of the entire range of statutes and then the ones that are just highly likely to be Jim Crow. And then we could take those, the Jim Crow corpus and we could do some topic modeling on that. We can do some additional investigation in that for able to find additional funding. So like Lauren was saying, it's gonna be quite a tremendous effort. And a lot of data for other researchers to dig into as well. So I'm sure this will spur quite a lot of interesting work. We have lots of folks thanking you for this wonderful presentation. So looking forward to working with the Jupiter notebooks. Minor technical question, are you using plain text files or XML files with Python? So right now with Python, we're working with plain text files. They're a specific type of tab delimited text file called TSB that are often used with OCR texts. In terms of the XML, I know that one of our goals is to provide these files that are annotated with the metadata. And that will be in XML format. We are not yet working with those in Python. Okay, thank you. And thank you for that question. Another question here, building on the previous question, can you discuss briefly the unit of your analysis? The title refers to laws, but the tagging seems to be page and section of the legal volumes. Right. So the section is the law. So that's the actual individual law, the smallest part of it is the law. And they're arranged by chapters. So you have, well, you'll have your kind of your law type. So it would be private laws or public laws or public local laws. And then within that you have your chapters and then you have your sections. So the section is the actual individual law. Okay, that's helpful. Thank you. And let's see, can you talk a little bit more about the parameters of your topic model? That makes sense. Yeah, I'm afraid at the moment, I don't have the details on what parameters were used in the screenshot that we shared in the slides. So unfortunately I'm not able to expand on that. Okay. Okay, thank you. Lots of great questions, clearly lots of interest in this topic and tremendous amount of appreciation for your presentation and the work that you've done. We still have a little bit of time for questions. Just a quick thanks for the clarification in response to my question, great talk, thanks. So please feel free to type your questions into the Q and A box. Also, we have the ability to unmute attendees in this tool. So if you have a comment you'd like to make or if you would like to ask your question live, you can raise your hand and I can unmute you and we can turn on your microphone and ask that question live. And while we're waiting to see if we have any more questions, I just wanna remind everyone that this is part of CNI's spring 2020 virtual meeting which will continue on through the end of May. We have lots and lots of more webinars planned for you and I hope you'll check out the meeting website at cni.org and take a look at all the webinars yet to come for the next few weeks. So seeing that we don't have any more questions coming in right now, I'd like to once again thank our panelists so much for joining us. Thank you so much to our attendees. Just let everyone know that I will be closing down the public part of this presentation. I will be turning off the recording but if you'd like to hang around maybe sort of approach the podium and have an informal chat with our presenters, please stick around, raise your hand. That will signal to me that you want to have an informal chat and I will turn on your microphone after I stop the recording. So thanks very much everyone.