 research in black culture. We are just so happy to have you here and this webinar is going to be about you know our computational accurate science. We want to say thank you to Dr. Richard and Dr. Jane and then the team who are going to present their work and you know kind of go for everything. So I just want to let you know that if you have questions please put them in the Q&A section and take time you know feel free to introduce yourself to tell us a little more about you so just kind of in the chat in the chat room. Without further ado I want to say thank you to the Milan Foundation who made possible this webinar series on emerging technologies, big data and archives. I also want to say thank you to Claire the Council libraries and information resources. I also want to say thank you to the Schoenberg NYU New York University and also to the Oklahoma State University actually who coverstepped this series on emerging technologies, big data and the archives thanks. So now it's over to you Dr. Richard and your team. So I'm going to share my screen just can you hear me well enough? Yes. Okay welcome everyone absolutely thrilled to be part of this webinar big thanks to Claire and also to organizers of the Erie conference. We're actually kind of double streaming using this channel and this is one of the virtual archival Erie week sessions as well so we should have some really good attendance and there's a there's a really good relevant focus theme of archives and technology. So let me move forward here. These are my collaborators today so this is going to be a collaborative a group session and actually it's a reflection of the nature of this field that we're going to introduce and talk about and share with you. All these projects are kind of learning by doing, working in interdisciplinary teams with clients, customers and very diverse teams with with computer scientists, archivists, librarians, information scientists, really interesting and exciting work. Hope to give you a sense of that. So you'll hear I'm going to give you a little bit of context this is the agenda on what this thing we call computer computational archival science is where it came from and what the relevance is in the moment so that's the here and the now. I'm going to talk to you about the launch of a collaboratory that is meant to facilitate disseminate collaborations in this space but by beyond the context my primary focus and hopefully the first 10-15 minutes is to show you a just-in-time case study that students, practitioners and I have been working on for maybe two and a half weeks now since mid-June and the idea behind this is just to show things in progress so that these presentations don't look like well-finished, polished products. This is really in-flight work so that's that's going to be my focus and the topic is going to be artificial intelligence and deep learning using FDR World War II presidential library diaries. My colleague Jane Greenberg is going to transition and talk about very significant library data science archives educational initiatives in particular her IMLS funded Leeds program which might be of interest to some of you online. The core of the presentation is really going to be the fellows both the Leeds fellows and the AIC fellows who are going to spend probably two thirds of the presentation working and showing you an in-depth case study on the computational treatments to remember the legacy of slavery also labeled reasserting erased memory a very timely project that exercises a lot of these techniques and then we'll all come back and wrap up and hopefully there'll be some time for a good Q&A and discussion. So let me get started. So what is this thing called computational archival science? Well this is actually a conversation that started four years ago and it came on the heels of what does it mean to start computing things on the archive side and we looked in particular at more established more mature fields. So computational xxx so computational social science is well established certainly computational biology has been out there for a long time. There are programs, degrees, postdocs in this field. Kind of a newcomer is computational journalism it's just a few years old looking at news feeds it intersects a little bit with archival science so this is the focus of our seminar today. Computational archival science it's a little newer but it's rapidly gaining momentum. So what is CAS? Well it's an attempt to explore the computational treatments of archival and cultural content. For those of you who might be interested after this webinar there's a Google group I put up a Twitter feed for a new collaboratory I'll talk about in a few minutes. So here's a working definition of CAS. We're calling it a transdisciplinary so beyond interdisciplinary very collaborative field focused on the application of computational methods in support of archival things archival processes so in particular in support of appraisal arrangement description preservation and accents one of the additional dimensions to this to this definition is the notion of scale we're interested in computing things on very large collections where sort of human processes and workflows might no longer be adequate. There's a foundational paper that came out two years ago it was actually written four years ago when we started took a little while to come out if any of you are interested this is kind of how we we set this in motion and it gives you examples of interdisciplinary efforts and through eight case studies looking at computational linguistics digital humanities graph analytics archival representation things we call computational finding aids digital curation etc and there's there's some interesting content there in particular there's a focus on takeaway lessons and how what the significance of this could be in an educational setting again as part of my preamble giving you some context this was four years ago well this is how it's manifested itself right now if if you want to dig into this a little more there's two issues of the Journal of Records Management it's under emerald publishing that are coming out this summer and I was I was guest editor with Julie McLeod and this was looking at the kinds of computational topics we'll we'll mention again in the next few slides explainable AI natural language processing automation of appraisal computational archival science distributed ledgers ethics in in the computational age etc so so keep your eye out for that I also want to mention an ACM Journal on Computing and Cultural Heritage it's a special issue and I'm co editor with my colleagues Mark Hedges from Kings College and Irene Gouda Roulie from the UK National Archives the deadline is at the end of the summer so if these this presentation inspires you and you're thinking of contributing submitting paper ideas please check this out all right almost there they're here in the now what what is the purpose of all of this and what is the sense of urgency well as as we've been discussing for a long time there's fundamental changes afoot in the way we acquire manage and present cultural collections this is nothing new but the pace has accelerated if you throw in COVID-19 we have a major major challenge with suddenly lack of access to archival materials to libraries museums and this this puts even greater sense of urgency into digital access and preparing and computing things to provide new modes new modes of interaction just a few we could fill several screens with interesting initiatives and things that are on the horizon but just a few representative and important points in the here in the US uh many of you have probably followed this there's an OMB and National Archives directive that states that in the next year and a half only transfers of electronic records with appropriate metadata will be accepted this is pretty fundamental in terms of policy change and is likely to have a ripple effect on on the entire landscape records management landscape if you if you just look around in the US this Smithsonian a few years ago launched a data science lab they're doing some really interesting things so this is in the middle of the museums you have artificial intelligence deep learning computer scientists biologists who are embedded in cultural institutions this represents a significant shift in the UK our colleagues at the National Archives have launched some really impressive digital research programs etc so the landscape is really bustling almost as an intervention at our end we just and I'll tell you who the collaborators are in January February launched our own virtual collaboratory which is called the Advanced Information Collaboratory AIC Collaboratory this was launched at the Alan Turing Institute in London at a meeting and these are the the main goals it's looking at the challenges of disruptive technologies for archives and records management these include but are not limited to digital curation machine learning AI etc pursue multidisciplinary collaborations which you will see several examples of today leverage untested technologies to unlock and discover hidden information in large stores there's a training educational component which my colleague Jane Greenberg will will touch on and then a very important component that has to be embedded in all of this is the ethical dimension of access and use in the age of big records and big data so here are some of the founding partners former colleagues from the National Archives Mark Conrad Michael Kurtz current colleagues at the University of Maryland Greg Janssen Bill Underwood and then distinguished colleagues from the UK Irini Guderali head of digital research programs at the National Archives Mark Hedges at Kings College of London Vicky Lemire at University British Columbia and last but not least or the Dr. Linice Williams who heads a really interesting and timely collaborate collaborative called the Vera collaborative which as part of this intervention looks at responsible archival practices around visual and material culture of communities of color there's a really strong focus on looking at these environments for possibilities and opportunities and detection of erasures and including racial and erasure racial erasures and representational erasure in the in the context of computational treatments we've also launched we have international network of partners many of many of whom are online I don't want to read the whole list but I really should if we have a little more time we can come out to this Dr. Anguillon at UCLA Bruce Ombacher Sarah Buchanan Marisol Ramos and I hope I'm not I'm I hope I'm not forgetting one Karen Gracie etc so with colleagues in Europe from major cultural institutions and academic settings all right almost there if you want to know more about this space go to the CAS portal I put the link up here this is part of our AI Collaboratory.net virtual organization there is a solid and interesting body of work that goes back four years now we've had some 27 workshops since 2016 four major IEEE big data conference CAS conferences so if you go to that space you'll see over 50 papers and slides their presentations publications infrastructure and lots of projects so I'm listing a few here in orange these are some of the projects we might touch on today um if you go to the second CAS conference you'll see these kinds of mappings this is why it's really archived specific of mapping archival concepts to computational methods I'll give you a few examples classification of archival images can be done with AI personally identifiable information PII we have studies and papers that look at NLP and named entity recognition decentralized record keeping with blockchain technologies matching of records and distributed databases with graph and probabilistic databases etc so lots lots of context if you want to dig into this here are a few examples of projects this was a UK US two-day datathon at the TNA last June this was in the fall a very interesting project with cultural partners densho.org looking at computational thinking to unlock the Japanese-American world war two camp experience with some really good projects and interesting developments okay so I'm going to end on this which is an accelerated walkthrough of a particular case study and the students then will will will walk you through a much more in-depth analysis of the project we started working on last fall so this is the Morgantha Holocaust collections project um more Henry Morgantha was at FDR secretary of treasury for 12 years from 33 to 45 the library there has some 800 over 860 volumes bound volumes of all of his correspondence business speeches interactions memos etc and there's also a an additional series with with over 30 000 index cards into the diaries so the card on the right is has a label of war refugee board and references sections and documents in in in bound volume 696 this is the student team like to acknowledge Renee Geary she's a practicing uh uh librarian teddy rambi is a computer scientist uh fabulous colleagues at the FDR Presidential Library so I'm going to show you two quick things how do you how do you work on these kinds of projects what do what do you do well there was a digital curation portion of this project and we just started looking at AI and deep learning on these kinds of documents and the results are preliminary results are are really interesting and have a lot of potential so let's start with the index diaries this is an index one index card we actually go to the website parse the webpage pull out all the diary PDF files by calling them there's 65 of them we explode them into individual jpeg images which gives you some 30 000 and then we actually run traditional optical character recognition on each image header why are we able to do that it's because um we kind of know exactly where that information is going to be it's in the upper left corner it's uh it's spatially segregated so the traditional methods still work and then we collect all the resulting text headers this is what it looks like you take the card you OCR it you extract the header we apply that to all 30 000 cards and we actually create extract an initial set of almost 6400 headers they're not unique but this is something we'll use to validate the artificial intelligence in later steps the first series i'm almost there and then we'll hand it over to student we'll hand it over to Jane and the students the the first series is uh is even more interesting it's the diaries themselves if you go to the webpage here you have kind of a workflow you parse the webpage you extract by crawling all the PDFs which sometimes are multi-part you extract the table of content from the first PDF when it's a multi-part you explode that PDF into individual images and then you're able to populate this table with metadata that's been extracted from this process we then use that table in the metadata to create a content tree so what is a content tree well for the for the first book we would use this information and populate a set of hierarchical folders with the table of content exploded pages the main table of content and all the data files themselves which are the the scanned PDFs of the books themselves we then run that on all 881 actually uh bound books and create a large tree with over 6000 files and 25 gig of data um and then we're ready to go so i'll finish on a on a couple of slides there with me this is the AI machine learning part um so we are now ready to go to the table of content files there are 3579 of these images that we are going to operate on we essentially build a model to detect sections and these images this is how it's done we train the model we essentially annotate a number of pages we show the algorithm where and what the sections we're trying to pull out look like and that's done um if i go back um one second here that's actually done with a google cloud machine learning machine we then use an image library to crop all those recognized psychophacial recognition the recognized boxes or faces so so we've exploded these 3500 pages into tens of thousands of small sections and those that's our pool for the machine learning we run that through a second it's a two-step machine learning AI process we run that through another deep learning tool this is what it looks like and we train it on these little windows this time there's not just one label but there's a bunch of labels five we say this is this is like facial recognition we are this is pre-ocr we're saying this is we want to recognize faces in the document if you like this is what a nose looks like this is what eyes look like and where they might be located these are ears etc so we train it again and the deep learning then goes and identifies all sections and these carved out documents that look like those things so we we train it to recognize headers content dates pages book numbers and what comes out of the deep learning process is a an ontology a tagged set of attribute value pairs and we we're going to collect that information for every single one of these things so i'm going to end there but the the future is certainly here it's also now and to quote my colleague Mark Conrad there is an imperative looking at all these developments for educating the archivist in records management of the future for the current and next digital world this is my my segue to to Jane Greenberg who is going to delve into the educational implications of these new and changing landscapes so applause here thank you i should share screen now yes okay for its first first slide everyone's good not hearing any complaints chain your your notes are showing so i don't have any notes i don't know how to not display them is there something i should do oh is that better oops okay i think i'm going to go on is that sound okay um hi everyone i'm jane greenberg i'm a faculty member at the college of computing and informatics drug school university and just sort of to start and say thank you i'm really excited to be here and thank you to clear and Rebecca and azor for really pulling this together um it's a very exciting time um i direct a metadata research center and i i don't know if you saw me even if you're looking every time richard said the word metadata i was going like this and um the that's a big a big uh draw for me and the way i've gotten into this computational space um is uh sort of its cliche but your data is only as good as your metadata is something i say and i think is really really important um what yeah so what i'm going to do is focus on uh the educational aspects and the importance of training and education and i'm going to talk specifically about a project called leads which stands for library education and data service and data science for the it was developed through support from imls so hats off to imls um and just to draw a little bit from the metadata research center and um why the link between data is that the center that i um uh have developed in many collaborators and you can look up and find out more about that uh was founded with a focus on metadata semantics and ontologies and to formalize solutions and that's so integral to the kinds of work that is going on in the data science space and particularly with with this initiative of computational archives um on the first page there i have a slide um we had an event with the library carpentries um uh folks and leads fellows and this was from this january and uh several of leads fellows we've had six leads fellows uh through collaborations actually work in the computational archival space and specifically with the with the AI collaborative so let me move on so to tell you a little bit about leads the the program was initiated with an emphasis on doctoral students and preparing future educators to bring data science techniques into education so preparing the next generation and so engaging doctoral students in immersive learning experience with a partner week at the time it was called the national digital platform now the terminology is moving to national information infrastructure um but having some partners that had large corpuses of digital data in in the library archival museum space and preparing them with skills but also to be able to integrate this into their everyday work and also to develop curricula to be confident in developing curricula as they became future educators and also to build a cohort and we have had 21 fellows we had a set of 2018 fellows and 2019 fellows and we are now hoping to move the program forward in a way that would not only be doctoral students but early career folks people on the front line and we are we are working on that just a little bit more about the the program so on the left are our partners I kept saying national digital platform these are the partners that had collections that were considered big data and digital data for the fellows to work with and each one of these sites had at least one to two and some of them had three mentors that were there for instance the DPLA the digital public library of america the students that work there had three mentors working with them OCLC had two the AI Collaboratory which is the space Richard Richard and Greg Jensen were serving as mentors and on the right are some of the the tools that our students learn to work with and the way the program operated is as students had to apply they were evaluated by mentors and ranked and so forth and we we really look for people who who wanted to get into this space based on their essay and not necessarily they had data science skills we wanted people who who wanted to learn and we had we had more applicants than we could fund which which is exciting meaning that there's interest in this so it was tough but we were able to fund 21 with over the two years with the funding we had the students who were leads fellows then completed an online data science curriculum and that was about 15 hours of of work then they came to Drexel and participated in a boot camp and that was a three-day intensive boot camp and by being together we helped to form a cohort and some of the skills and techniques that they learned they had sort of lab exercises where they really rolled up their sleeves and worked on things data mining and LP visualization there was work on curation and data management and then students were some of them were at different levels so some went a little more in the python area and some went a little more into the R and and they were students who came to the boot camp they were required to know about their projects before they came so we could help them identify what tools would be would be good and then they went and they had a 10-week immersive experience over the summer working with their mentors to I say roll up their sleeves and do some work to move a project forward and great outputs on this slide here I'm just the six fellows that have I'm going to say dovetailed or worked in the art the computational archival space specifically are listed here and so at the top are the three of the fellows Jamela, Chris and and Adam and each of those were fellows that were placed at the IA Collaboratory and let me say when I say placed it was virtual this was all virtual the only thing that was in person was the the data intensive boot camp where they came to Drexel and as as we're facing things now we know that that we could do this virtually and then on the bottom slide so so let me just say Jamela worked with the Dencho collection which Richard has mentioned and Chris and Adam were 2018 fellows and they worked together with the another collection that the AI Collaboratory has on redlining of cities and both did great work and then at the bottom are Sanya, Sanya Pascua from Drexel, Hanlon from UNC Chapel Hill and Ming, Ming is in the middle from Missouri and both Sanya and Hanlon are actually on the call and can share their experience and they had other placements in their Leeds Fellowship but they also went and participated in the data found at at the AIC in Maryland so just to move on here be mindful of time so this is just a small slide of Leeds by the Numbers and some of the outputs of that the fellows have come up with and actually the papers needs to be updated they're now four papers from students experiences and I've listed a few here of some of the kinds of outputs that the students had in terms of participating in Code for Lib you know in their own institution and so forth and actually Sanya even went to the Philippines to participate on and share about Leeds and gave a paper at the Dublin Core Conference so I think this is really my last slide and you can see the Leeds Fellows in the very bottom and other people that have been involved in the AI or the Computational Archives and working through Richard's Center and this larger effort that the Metadata Research Center is now very excited to be connected with and I think this is a perfect segue to turn it over to Lori and actually just one final thing I am so impressed at all the the locations people are coming from I'd like to make a data map of it so great all right Lori take it away I'm going to stop sharing thanks wonderful thank you Jane and welcome everyone while I find my screen to share here my name is Lori Perine I'm a doctoral student and a lecturer at the University of Maryland's iSchool my pronouns are she her hers and I'm coming from you actually from the traditional hunting grounds of the Piscataway people the original people who were here in what is now known as Maryland and so I'm very pleased to be speaking with you today I am going to be presenting to you along with some of my colleagues on a specific project that we did using information from the Maryland State Archives related to slavery in the state of Maryland and what you see here to begin the presentation is our pictures of the teams that work together on three different data collections so there is a quite there's a very large set of collections related to slavery that the state archives has digitized and made available to the public our teams worked on three specific collections Runaway slave ads Lori I think we missed we have something going on with your mic it even suddenly we couldn't hear you Lori still the same can you speak do you hear me now yeah no that's better thank you okay thank you um so it's like there may have been a switch in the microphone all right thank you Sonia so menu missions menu missions and certificates of freedom actually have a connection to one another menu missions are the are the legal documentation that an enslaved person has been granted their freedom in the context of the state of Maryland that usually is a document that has been filed somewhere legally in a courthouse with the legislature related to that in Maryland I'm not sure if this is something that was done well I know for sure this is not something that was done throughout the US colonies at the time and then eventually what became states but there were documents that needed to be carried by freed people whether they were born free or formerly enslaved or manumitted that were called certificates of freedom so there's that natural connection obviously someone who had a manumission record related to them also would have had excuse me would have had a potential certificate of freedom so we were looking at those two collections in particular and in looking at those connections collections what we found immediately was that we needed to have some sense of what I'm calling a data biography you can think of that as the contextualization of the data that we're looking at and think of that in a very broad way if we were going to be applying computational thinking computational methods we really needed to understand the extent to which metadata was both explicitly there and what implicitly we need to take into account as we looked at the information and that meant we needed to know about when what who why how where not just with the digital resources but also with the original documents as well and that informed what we eventually used in terms of our computational methods so we needed to apply this to all phases of our workflow and this just gives you a top level view of what I was just saying that some of those things that we are thinking about in terms of let's say provenance of original documents eventually impacts the datafication of those documents and our actual choice of methods and tools when we apply something computationally and so you can see how all of these sorts of attributes will influence ultimately our computational position so several key elements of the data biography happen to be the historical context itself quite a bit of information on this slide the two things that I want you to note is that slavery in the state of Maryland was initiated in 1642 and eventually abolished in 1865 so there's quite a wide range a wide date range that we're looking at and in which we may potentially find information in our archives that will become relevant a little bit later in the presentation specifically with respect to the menu missions collection we have to be aware of several factors one was the sort of legal context in which menu missions were made it turns out that within the state there was a certain regulation of menu missions beginning around 1692 that's about 50 years after slaves were first and slave people were first bought into the state and there was eventually at the end of the 18th century in the late 1890s there was state legislation that regulated how many missions were to occur so there's a there are two different legal structures that are governing how many missions may or may not have been recorded not only that unlike all of those beautiful files that you saw in the FDR collection the original source documents for menu missions were very wildly varied there were enslaved people who purchased their own their own menu mission their own freedom that was recorded in a certain way there were menu missions that occurred via wills and in probates there were menu missions that were basically property transactions there were menu missions that occurred because enslaved people joined military service for various wars the revolutionary war the war of 1812 the civil war there were also menu missions that occurred due to movements that were that were meant to recolonize black people in the US in the Americas and to send them back to Africa so all of those took different forms none of them standardized by contrast these certificates of freedom we found had a little bit more standardization not not the sort of lovely standardization that we would find in records keeping today but more structured as when we compare them to the menu missions these certificates of freedom basically were first issued in 1806 this was around about a decade after the state legislation was put into place to regulate so this this was actually very early in the life of the state of Maryland but they were clear legal documents there were certain elements that were common to those legal documents because they were meant to be identification papers in the early 19th century obviously people were carrying around ID cards or driver's licenses or even their passports they were using these documents which described them to identify themselves and to say that they had been legally certified as freed people in the state so when those original documents were data fight when they were put into digital form all of those elements were at play and so what we see is not only all those elements were at play these were data data fight over the past decade or so with certain types of technology and those technologies were primarily two-dimensional relational databases using scanners not really being able to use OCR as much so you're using visual recognition by volunteers who are transcribing information and then that information being put into relational data sets being cleaned up and then stored using the technology of that time so what you would have then is something like this where you've got and this comes from the Runaway Slave Ads this is a little bit cleaner in terms of being able to interpret and transmit the information in the Runaway Slave Ads you've got print as rather than a manuscript which is what you would have in the Manumissions documents and in the Certificates of Freedom but here at least you're able to pull information out relative to the slave owners and to the enslaved persons in this case fugitive enslaved persons what you'll notice is that there's quite a bit of other information that's contained in this document which may or may not have been in some way data applied and so we're beginning to lose a lot of context and a lot of that explicit metadata that is in the text itself that may not have been transferred into the data set so this is what we began with we began with that data fight since set of information began to enrich our understanding of the data biography to understand what we were working with so that we could then go along this workflow for our computational explorations sourcing that data cleaning and wrangling it filtering out and extracting information that would be relevant or that our technologies would be able to treat exploring it and visualizing it and so what you see then is that we have a more enhanced data flow from our original source documents into the computational exploration we're going to talk about each of those steps in turn starting with our data sourcing and the basic sourcing we basically here's a if you go to the website of the Maryland State Archives if you click on databases it will allow you to look in any of those collections that I've mentioned and this is what you'll see you'll see a page that looks like this so what we did was we took we took we gained access to that data by doing some web scraping using Python it's a sequel database where things are stored we scraped that database extracted primary fields there are also fields that are related to some of the records management which was not some of those were retained some of them were not in terms of really us wanting to just characterize the data we weren't so much looking for some of those management fields for this particular project that information was translated into a couple of comma separated data tables so that almost any other technology would be able to do so what we see now at this point is as we source the data when we're coming from those original documents that there have already been there has already been quite a bit of loss in translation literally so but also in terms of our potential technical translation so one of the considerations for us or several of the considerations for us were once again looking at that data biography understanding what metadata was or was not present that we were dealing with relational databases two-dimensional databases if you will and the limits of that that we had non-standard primary sources clearly there were errors that we would need to account for that there were assumptions that were made in terms of defining data fields that we did not necessarily know and that perhaps the archivist didn't necessarily know there were assumptions that were made in terms of pulling manuscript information out of documents and translating that that we weren't necessarily aware of and so one of the things that we continuously had to keep in mind was how well were the meaning of those documents and the relationships that were originally embedded in those documents how were well were those captured by the digitized representations that we had and how well could we actually preserve some of that in our digitized representations what elements were lost or ignored why and who decides what elements are retained and not retained. Let me ask my colleague Philip to talk about some of the specific challenges that we had in terms of data sourcing. Thanks Laurie. Hi everyone my name is Philip and I worked on the certificates of freedom dataset with Rajesh and we noticed while we were exploring the certificates of freedom dataset we noticed some of some unusual information that was captured in the dataset so we had to check the original documents. We found some transcription areas and in this sample currently being shown in the certificate of freedom for Joseph Codwell we noticed that the original clerk documented Codwell's county of origin as Talbot County but the transcriber actually recorded Baltimore County in the Excel spreadsheet. Then on to the next slide in this certificate of freedom for Jeremiah Brown we noticed that the original clerk who documented this record did not capture the day as could have been seen from the highlighted part here due to which the transcriber captured it as a date with the incorrect format with only a year and a month in this case as 1806 but to be consistent with the other date records we had to transform this date by adding an arbitrary day as the first day of June 1840. On the next slide while we were exploring their certificate of freedom dataset on Excel we noticed that the transcriber listed the heights and feet and inches of the former slaves now freed people in a one combined column in Excel instead of a separate columns so we also noticed that the transcribers listed Abraham as being 9.75 feet tall which is unusually tall and on the next slide in this certificate of freedom for Susanna we noticed that the transcriber listed her as being 100 years old in the Excel spreadsheet but the original records list her age as 80 and 20 years old so we're not really sure so therefore Susanna could potentially be 20 years old or instead of being 80 years old and this concludes some of the transcription errors that we found so I'll turn it back over to Laurie. Thank you Philip so as we said there there are a lot of decisions that are being made about what is captured and what is not and as I pointed out before even when we've got information that may be more easily easy to interpret because it comes in a print form there is information that may or may not be captured here is an individual who is who is integral to this particular event or transaction that's occurring and who's not necessarily captured in the database as well as some of the other information here about place and time once we have the data we're taking into account all of these possible omissions errors cleaning them and then filtering out what's important for us to begin to characterize the data as well so we can think of that as data wrangling we used various resources primarily open refine which is an open source data cleaning software and R which is also open source but takes a little bit more training to use. Rajesh if you could please talk to us about cleaning in the certificates of freedom collection. Sure thanks Laurie my name is Rajesh Nana Sabron and as Philip pointed out we both worked on the certificates of freedom data set collection while cleaning this data set we found that there were issues with formatting of few fields as an result of incorrect transcription maybe right so for example the field date issued the date of issue for the certificate of freedom was captured and available as a string field and due to limitations on excel which is not properly format dates prior to 1900 we had to use a custom field calculation on a tool visualization tool like tab blue to convert the string field into a proper date format so that we could visualize the observations in a time series trend which will be explained later on for the field age which is shown to the right box it indicates the age of enslaved person at the time of document creation and we found that for ages less than a year that is for the enslaved person who has an age less than a year in order to indicate the number of months the transcriber as has captured the decimal the month as a decimal of a year which is incorrect because there is a we need to properly convert them into correct fractions of a year so we did that as you could see in this box on the next slide here we see data cleansing effort related to the field prior status of the enslaved person during the documentation this field had mostly transcription issues with different types of entries for the same term like born free or free born etc so we had to classify them into a generic four types and the highlighted category one of those categories which says descendant of a white female we had to do some research on that to find out which of which of the four categories these observations can go into and we found that it should be classified as free born as a children born of the white female were considered to be free which was found upon research on the next slide we see here how we handle the complex data feature the complexion of the enslaved person who was and how it was captured by the clerk when while documenting the certificate of freedom and for few challenges in handling this feature which are listed here are the different ways the clerk documented complexion from their own interpretation because it is subjective and the and they also interpreted complexion and documented them in terms of colloquial words like mulatto copper or wood etc and which were not consistent between each county clerks and there were spelling errors and there were inconsistencies in identifying complexion between each county's one one county recorded it as something and there is another county recorded as a different one and and also some of the documentation were elaborate where you know instead of a bird or two they just had so many birds to fit in that feature so in order to classify them we we performed some clustering using open refined tool and and classified these complexion into seven broad categories from bright to black with light brown medium brown and dark brown among other classic categories um that's all we have for cleaning from the certificates of freedom now back to lori to discuss cleaning with the manumissions direction thank you rajesh um in with the manumissions documents um we worked primarily with our in our um to um to clean the data um to structure it um in a way that we could actually begin to um get make some sense of it um and so as we noted before there were quite a bit of non-standard entries in each record we weren't sure that we were always going to get the same information in in different records and there are quite a few records you'll see labor that there were thousands and thousands of records in both collections we had to replace lots of null values um and just like with the certificates of freedom we had all kinds of interesting uh transcription errors for instance our maximum age turned out to be 237 years which we figured was probably incorrect um so we went into um to modify that or to in this case I think we put it as simply not represented or not applicable but the thing um that was really challenging for us is that with um with these um manumissions documents we found that we had multiple multiple date fields um that were available and I don't know if this is large enough for you for you to see but there are approximately four different types of dates that were recorded in this collection sometimes there were null values for those dates and sometimes they were filled in so we were very puzzled by this we couldn't really understand what was going on with it and what we eventually understood and what we eventually learned by looking at the original sources and talking to the archivist and the historians is that um there were manumissions may have been granted or registered on one date but may not have been valid until another date for example um and that is something called a deferred manumissions there had to be we had to figure out some way we had to make some decisions to have um a single year or date that would allow us to do um comparisons uh with uh other collections and so we basically programmed an algorithm um to clean up the all of the date fields but then also to find of what appeared to be the most viable year for us to use so that we could go on to visualize the data as well so the other very important thing here was that there was tons and tons and tons of unstructured information in the notes that we did not pull out on this round though which we would like to look at as we continue with our expirations finally as we sought to eventually look at connections between our manumissions collection and our certificates of freedom collection what we discovered almost immediately was that there were fields that you know if we were just doing that basic data merging or matching fields that had the same names but had very different information in them so there had to be a modification uh a syncing of that information as well and then of course there were the standard spelling and transcription errors we're going to talk a bit now um about the exploration that we did and that's very much tied in with the visualization as well so we'll um you'll you'll see a bit of overlap in that a lot of the exploration um that we did once again used uh open source or Tableau is not totally open source but used um software like Tableau um and are um in order for us to look at the information um we first just ran some basic analytics and what I want you to notice is that we had to do all of that cleaning all of that investigation just to get the data set into a place where we could talk about how how many records do we have what's the geographical coverage as we clean the data what years were covered by our data um here if you remember I talked about this thing that's called deferred manumissions we were very puzzled when we saw manumission states in 1870 when slavery had ended at the end of the civil war 1864 so trying to figure out what was happening there that allowed us once we discovered that that allowed us to go back and fix our algorithms so that we could adjust our years um age range of the of the persons being manumitted or being granted certificates of freedom um the male female composition what jumps out immediately um is are that the uh most of the certificates of freedom are granted to males and we can think of many many reasons why that could be possible including a head of household um society in the 19th century is going to be focused on the male in a household but we also have to remember that um enslaved people households were primarily broken up and quite a few enslaved females were retained for purposes of continuing to produce children who would then be enslaved so they're they're interesting factors that are corroborated or come to light as we see that as well one of the things that I'll point out to you that were said in terms of the statistics on the prior slide is that we quickly saw that the information that we had did not cover the entire state of Maryland for example these are in the Manumission's collection we found these red these are the various counties in Maryland so we found that most of the records that we had in Manumission's were from Anne Arundel County and it looks oh I'm very sorry it looks like I've left Anne Arundel off this but Anne Arundel County has about 3700 records and Queen Anne's county whereas we had very few records for some of these other counties so immediately we realized that we actually didn't have information about the full state and that in doing doing some of our analysis we might want to obviously just focus on a few of those as well another here's a nice little visualization that we did and looking at the age distribution of Manumission's records by counting once again you get a sense of the age at which these Manumission's were granted some relatively earlier than others you see outliers in some situations here you get Anne Arundel which had the largest set of records as well but that immediately gives you a sense of how old people were when their freedom was granted or at least promise to them we did a little bit further visualization as well I'm not going to show you all of that but just a few examples when we visualize the certificates of freedom information in Tableau here once again this is showing both the basically the frequency how many certificates of freedom were granted at a particular time and a gender distribution and what you'll notice with this darker pattern here these are the females the female granting of certificates of freedom to females really wasn't until the 18 mid 1830s into the 1840s where we were coming across those sorts of records and obviously as you saw before much smaller number then you have this interesting spike here I'll come back to that here's another visualization that we did this time in R using two different techniques both visualizations are showing the same thing the frequency distribution of Manumission's documents by year by county this particular graph is a histogram which gives you a little bit more sense of the shape of that of that frequency distribution we see quite a few we see this peak that occurs and then suddenly a drop off and then just an unsteady pattern here this is another way to visualize the same information it doesn't quite give you the pattern as well here but does give you a sense of as you follow the dots of where and when information occurs so one of the things we saw right away is that when we could compare visualizations even across data sets we're noticing that there are patterns of information so that interesting peak that we saw in the certificates of freedom corresponds with this odd drop off in the Manumission's collection what does that mean what happened in 1831 so we went back to our context to our data biography and what we noticed for two seminal events occurring in Maryland around that time the first was the initiation of something called the Maryland Colonization Society this was one of those movements to repatriate to recolonize enslaved Africans to take them back to Africa and as a matter of fact quite a few Manumissions in this period were designed specifically so that the slaves could be the formerly enslaved could be sent back to Africa there was also a little bit of trouble in the state of Virginia next door to Maryland the gentleman by the name of Nat Turner led a rebellion of slaves in that state where that was nearly successful and which greatly frightened slave holders in the state of Virginia and Maryland and beyond so at that point in time of Maryland to cut down on the ability of free blacks from entering the state and also sort of put a moratorium on Manumissions for a time being so that they could essentially using that as a prevention mechanism against possible revolts or rebellions as I mentioned earlier and we're we're heading into the last part of our presentation as I mentioned earlier one of our key interest was seeing if we would be able to link records across the data collections this had its challenges because of some of the issues that I've raised before we identified what might be reasonable data fields to use in our linking and then once again as we looked at our data biography one of the things that we saw was that there were really only a few of those fields that would be available to us we could we could link with respect to geography county with a certain degree of certainty we already saw there were transcription errors with respect to county as well owner last name and first name was pretty steady we had there might have been miss spellings but we had a certain degree of certainty there because of the practices at the time we were not able to link slave name last name with a freed last name because it was a common practice either to not give a last name to an enslaved person or once an enslaved person had attained his or freedom they would adopt a new last name in celebration of their freedom so we had to focus on first names and as we know or another thing I don't think we noted this but age which originally thought we might be able to match with respect to age but since records were not kept up on birth dates for enslaved people that was not reliable as well so once we get that connection we did some visualizations I'm going to ask Rajesh to talk about how we used graph database software to look at those connections thanks again sorry and this couple of slides we wanted to share with you how we used contextual matching to properly link the manually machines and certificates of freedom collections and represent them as graph or network visualizations in the picture shown in the middle of the slide we see that in a graph representation each circle is a node and the blue circle here shows one documented observation from certificates of freedom and each manned mission observation is shown in red and orange circles shows the each slave owner which was an unique collection which we added for this purpose the relationships between the nodes are the lines which connect indicating that a slave owner named for example here a slave owner at the top named William Franklin owned an enslaved person whose first name was Sukh and last name was Nal just like Laurie mentioned earlier and similarly another slave owner owned a different person under the same name as per their individual certificates of freedom documents now without using any contextual knowledge when we tried to link the enslaved person using their first and last name only their first and last name between certificates of freedom and manumission documents we see that both these enslaved persons in blue connect to one red circle indicating data integrity error because the manumission document did have two different persons by the same name however their matching criteria was not right in the next slide now based upon our data analysis and understanding the context of the data and why the last name and first name is not the only not be the only matching criteria then we included other fields into the matching algorithm as Laurie indicated earlier by including fields like county owner's first name owner's last name and enslaved persons first name we were able to properly connect the notes to their individual records from both the certificates of freedom and the manumission data now we see here there are from the blue circle we see two connections going to the individual manumitted records of of their own slave enslaved persons record now we found this way of representing and visualizing digital archival data to be interesting and then I wanted to share with you although the contextual analysis of identity identifying the key connecting fields were manual and there is some scope to automate and improve on that this would be one of the many future steps to take this project forward which would be explained by Laurie back to you Laurie thank you Rajesh so we're folks who are finishing up this part of our presentation there's two more slides so obviously there's this is some foundational work that we can build on so what we could do as some next steps are take some of these visualizations expand upon them and create data dashboards with them dashboards that are interactive in certain ways using time-lapse looking at interconnections between the collections that we're looking at and also really just doing some nice static presentations of what we have available I mentioned that we have these notes fields that are full of textual information and so we can be using some current data science techniques to begin to mine those and to uncover additional individuals additional events relationships that we can then begin to visualize using some of these graphical database technologies that have just been presented to you and of course we're looking at continuing to look at how we can enhance and potentially automate cross-collection connections not just with manumissions and certificates but through some of the other collections that are available at the Maryland State Archives one of the things we'd love to do is develop a set of base studies where we're tracking specific individuals across those collections from manumissions to certificates of freedom through census collections through other collections that then give us a sensible life of an individual that previously may not have been visible in our records and we want to look at are there ways that we can extend some of the metadata that is attached to these databases so that we can make for better representations when we're using graph database technology. There's some suggested research topics these are not exhaustive. There certainly is quite a bit of work that can be done related to ontologies for collections that are related to slavery. There has been actually a recent paper there was an article out in January on this particular paper which is referenced at the bottom of the slide is coming out in August for colleagues at the Michigan University of Michigan who have done an ontology related to the transatlantic slave trade but one of the things that we see here is there's some very specific elements within this particular set of collections certificates of freedom which is not necessarily part of the story of slavery elsewhere. The relationship with the manumissions and also the legal construct which would invite us to look at some some other variations of ontology and invite us to look at ways that we can represent multiple perspectives of lived experience during that time rather than being centered perhaps on the slave owner in this respect. How might we look at extending metadata for these and other similar connections obviously in relationship to the ontologies or retrofit in order to enhance the ability to access information across collections and then as we begin to look at using machine learning and other data science techniques so that we can actually automate some of this discovery what are the types of probability models that we would need to use with this type of data. So that ends my part of the presentation. I'm going to ask Richard and Shane if they might have some concluding words and perhaps we can then squeeze in a few of the Q&A you've all been very prolific in your questions but thank you. Richard we can't hear you. Can you hear me now? Yes. Sorry I submuted myself. Just have a few slides to talk about next steps and then we can go back to you. So let me bring those up. Can you all see my screen? Yes we do. Okay thank you very much. So just a few additional references if you want to follow up on this. There's a really exciting very important dimension we didn't touch on so far but it's well represented in this body of work and I just want to give you a few references should this interest you or should you want to work with us or follow up and this has to do with what some people are calling computational thinking. Before we do that I have a URL for our next IEEE big data computational archival science workshop. It'll be the fifth year we do this. I'm sorry it's not there's a typo there I'll correct it. It's not Los Angeles. It's in this mid-December it will be in Atlanta, Georgia. I'll update that and there's a call for proposals there which with a deadline sometime in October. So there's a dimension which is computational thinking. This has been a very useful framework for us and let me go there. Indeed we have a project that's funded by IMLS called developing a framework for mapping computational thinking into library archival science education research so CT laser. There's a reference here we're about to conclude this project was a little over a year project that you might find interesting it's the final report. Essentially I have three bullets those are some of the takeaway messages that we're we've essentially extended this CAS work and we are we're very clear and convinced that computational thinking is an important dimension that needs to be part of these investigations certainly part of the training and learning space in particular for MLIS but also for training people who are in the field and practitioners and professionals. There's an effort right now for those of you who might be interested in that to develop what we're calling CT or computational thinking enhanced lesson plans that we can then share so that so that we can leverage our collective work and build on our insights and developments and I have another link to that it's a framework where we are posting case studies they're recorded in this thing called Jupiter notebooks which is a way of telling computational stories and running these stories so it's a it's an interesting format that's very popular in data science so we're trying to bring that over to the archival space to create a framework that that will allow the recording and sharing of knowledge studies including lesson plans which we're about to work on next and then finally we launched recently this this network this work as as eloquently demonstrated by the students is highly interdisciplinary you you can't do this alone you need teams and collaborations so the intent of the AIC Collaboratory network is precisely to create a community of interest to learn from one another so I really invite you to connect with us we'd like to grow this space and this is probably the only way to move the field forward and then I'm going to finish with a few additional references which are which are really solid in this space so one of our collaborators on this grant was David Weintrop who established a what he calls a taxonomy for computational thinking it's essentially a set of 22 building blocks that fall in these verticals things that relate to data things that relate to modeling you can think of artificial intelligence deep learning as modeling as well things that relate to problem solving things that relate to systems thinking practices for example how do you think in levels how do you break down a complex system etc so there there are several references here that really help with this so David Weintrop has done this mapping from CT to STEM part of our work in our collaborative is to go from CT to library and archives so CT laser my colleague Bill Underwood formerly from Georgia Tech who's actually online right now has has a couple of really really good references at the fourth cast conference he took all the workshop papers and map them into this taxonomy in terms of which archival science ideas were expressed and how they reference potentially computational thinking topics so very very significant paper there's also a second paper that's the the last one there CT laser practice which takes these 22 building blocks and maps them validates them one at a time and relates them to well known archival archival science case studies that are published and they're out in the field so it's kind of a two-way validation so these are very important papers and I'll end on this note just to give you a little more fodder if you if you want to get into this two papers that relate to our partnership with Dencho.org and Jeff Froh in particular on PII with World War II Japanese American incarceration camps and also a more recent project a more recent paper in the fall these are these are both part of our CAS workshop series are published there you can look at the slides you can look at the papers and this is the third and the fourth which has five case studies with some 20 students and and there we have connected all the case studies explicitly they all reference the same set of 22 computational building blocks so to paraphrase my colleague Jane the what we're what we're moving towards here is a metadata vocabulary sort of taxonomy to capture computational metadata if you will or to be able to describe computational experiments in archival spaces and describe them in a coherent and consistent way using a controlled vocabulary that allows us to relate these different case studies to contrast them to link them etc and that's what we're building into this CASES website infrastructure a systematic way of thinking about computation and a systematic way increasingly of developing computational archival science projects we think this is really really significant it's a major breakthrough in the four years I've been involved with this effort and if you look at these two case studies I think you'll get a good sense of why I'm stating that this is actually a significant progress and a and kind of a breakthrough in terms of how we can move forward so I'm going to stop here and stop sharing my screen and hand it back to the organizers thank you very much Jane I'm sorry I quoted you but I I was channeling you you may want to jump in as well I think it's good to turn it over just just thank you you know and let's let's let the organizers take carry over so thank you hi everyone I'm Dr. Azure Stewart I'm one of the organizers along with Dr. Rebecca Baye so this concludes their presentation portion so now we're going to address some of the questions that you guys have put in the queue if there are any other additional questions that you would like for us to address please make sure and add it in in the Q&A section um so I'm going to first start off on one of the attendees had discussed that they are using menu mission for their work in Puerto Rico it's just more of a commentary about thank you guys for sharing your research the next question I really want to get into is what is the timeline for a project like this how long have you been working with this data and any one of you from the team can answer this question yeah let me take this tab and write so our data done is a culmination of eight weeks of data exploration in the DCIC then but now a AI collaborative it was in September and October and there were 17 students undergrad masters and doctoral students work on those so eight weeks of data exploration thank you um the next question looks like one of the participants says it looks like CAS and participation is focused on data fine digitized material is there also experience with digital born material and for instance appraisal um now although extremely interesting it sounds like more like um computational historical science than archival science I don't know who on the team wants to answer it's Jane I just want to add to the last one and just say in some respects you know that those are Sonya's answer was great and that's the chunk of time she was able to give but these projects are never finished in some ways because they're just there's so much that we can learn once we get them computational so I just you know other ways to mine it and look at the data but it's like it's forever working project when you're dealing with research so there's never a there there you know just more angles and different additions to it so so maybe I can jump in to to answer that question we're we're we had a project that just ended a year and a half ago which was a collaboration with the national center for supercomputing applications at the University of Illinois called brown dog which was funded by the national science foundation and it utilized a lot of the same methodologies and techniques that you're seeing today and it was entirely well not entirely but predominantly based on born digital records so so the really interesting conversations are these conversations at the intersection of born digital which is absolutely essential you saw the the reference at the beginning to the new policies OMB national archives are putting out in particular that in moving forward all records transmitted and feeds will will will be digitized, datified and born digital so that's the future we're moving into. I think we can learn a lot with these sort of forensic deep dives which is what you saw with the students and the I think this is an interesting research question is figuring out what you can do with born digital and datified content and what points the intersection you have and where the techniques simply diverge and but we're definitely interested in and in the live and and historical science as well that's that's that's part of a number of our projects including this national science foundation project. Thank you. I want to address in the next question many of folks have been asking about will this be recorded yes it has been recorded we will load it to the emerging technologies website but also through YouTube so we will also have it posted and available for those of you who want a reference with colleagues or for further exploration. I'm going to dive into the next question. This is how much have you looked into collaborating with archival practitioners outside of cultural heritage organizations or academia? I love the approach of building an educational framework and increasing scholarship but archival science practitioners and archivists have been doing this kind of systems thinking in a wide variety of disciplines including with born digital collections and big data. I see a lot of discussion happening with data scientists digital archive scholars but not practicing archivists we can learn a lot from each other. I can jump in for and then somebody else can carry on. So just with the leads project that is the intent for us as we continue to really reach out to people on the front lines in the field in these institutions not just in the academic setting you know in the classroom setting people that are working on the front lines serving as mentors and then early career and actually in the original leads program some of these folks on the front lines who have who are in archives and libraries have served as mentors so we want to build stronger links and so early career folks people are out in the field who didn't necessarily have this training is also important to form those links. Richard do you want to sure no it's we really want to build bridges couldn't agree couldn't agree more with the with the comment this is not this cannot be an academic venture it is not from our perspective we're all working not all of it is reflected in these slides but I'd be happy to follow up but we are we have active collaborations with practitioners professionals in the field those are the those those those are the drivers that's where our our students end up being in these spaces so there there is no dichotomy it is all connected with Mark Conrad I'm teaching a digital curation for information professionals certificate class right now with professionals in the field you have folks with decades of experience who are coming back and developing these skills and working with us you also have folks who just recently graduated we're talking just a few years ago who are coming back and sharing their expertise and who realize in their institutions which are not just cultural institutions their financial records their international groups their business records they're they're all over the map but who are realizing that these are really essential skills and there is a allegedly a point of rub intention here is that it's it's I think iSchools MLIS programs need to step up to this challenge not not all the faculty are comfortable with this yet which is which is normal it's very collaborative as you as we hope we demonstrated today administrators aren't always supportive students tend to skirt away from these things so we just we we just need to provide more opportunities for learning and collaboration and doing it with practitioners is absolutely essential as a matter of fact we have a few grant proposals that are in the works that are precisely bringing academics and practitioners and librarians archivists from a various ilk together it's essential i couldn't i couldn't agree more and can i can i also talk okay so um there are really a lot of takeaways from this project and i learned a lot but what i could highly um point out is that you know um in in in in an existing field focus on on different issues which is actually digital data issues in humanity there's actually an existing field focus on this which we call digital humanities you know creating community sub-practice that allow projects to be federated with each other is likely you know what i um as i said what i took as a learning opportunity for everyone so as as you mentioned we can learn from each other really we do need each other to work hand in hand with every project because it it it doesn't only concern data but it also concern human and humanities that's it thank you thank you yeah so complex issue with multiple actors that you guys need to have involved to accomplish the work to move forward so exactly yeah um i move on to the next question um one of the attendees asks do you recommend to start a project with document digitization or with data practices and it's up to whoever thinks this would be the best question that they can address i can give the short answer and then or do do any of do any of the fellows or anybody want to say anything first then i think i can provide my opinion after your initial stuff or do you have any okay i will go for it so i asked hello i'm Hanlin so i'm initially starting following this uh computational archives like uh i first from James like purchase from Lee's where i learned uh something and it's especially about how to deal with data and then data was offered to like do computational archive and from my like personal experience i think like both ways are valid and both ways have its own like a good thing and challenges so for me it's just like my personal preference like document document digitalization is going to be somehow more challenging as we identify those issues in this presentation they're going to be transcribing arrows they're going to be a card requirements on learning the context of the entire the background of your data and then how to digitalize it and difference or other on the other hand with data practices because if we have like like in this project the data was offered to students was provided by the mariland state archive is so the students are more focusing on the practice side of doing practicing what does it learn at school and what they have like uh experience and they are so i think it's more somehow easier for students to like participate in a way such that they have some like saying they have some foundation they could build on instead of spending the eight-way time to learn the reaching context of how a the documents the should be and what's the value behind but this is my like initial thoughts and anyone to jump in and anyone else wants to jump in um i don't remember the the exact specifics of the question but i i mean my initial answer in thinking through this is i mean it depends on what you want to get out of it you know what's your learning objective um and what is your your research question or service that you want to provide so um but you know having the foundational skills and keeping building them and increasing them with any with any project um you know there's there's the learning curve and and i think this infrastructure we we've prototyped this last year that we hope to develop further which is this cases computational archival science cases will increasingly reflect that diversity and you will have born digital data sets you will have a variety of provenances so that so that whole range of interventions can be illustrated and and and these these we hope to package them as learning modules so that people almost like you know going online these days and taking a class so that you could kind of pick and choose a point of entry into this space and decide how how you want to learn and navigate based on your preferences and background i'll have anything to move on to the next question um one of the attendees um has said given the translation issues that pervade the relationship between archival context and computational um datafication i.e losing context and creating new ones throughout the data datafication process um what have you learned about archival practices before and um computational mechanisms today uh specifically with the case study presented today so let me start this wine um not only about this case study but um my my current um project so i'm currently doing a project with the USA national health institute this is through the national center for translational science so this is really kind of a related you know answer to the question so this um national center for translational science are developing biomedical data translator it aims to advance the development of high need cures and reduce significant barriers between research discoveries and clinical trials so we are developing kind of um um a comprehensive relational and dimensional biomedical data translation translator i mean that integrates multiple types of existing data sources and these include not only the objective signs and symptoms of disease drag effects intervening types and biological data relevant to um understanding um but uh pathophysiology we are also looking at archival pathophysiology now archival practices and computational mechanisms we're both are both helping us today in shaping the frameworks of our workflow so this project is kind of a work of 18 clusters of different you know professionals from the field of art archives um library and information science from medical field from programmers so um as i said earlier and i see also mentioned there's really this complex system that we do right now and interdisciplinary multidisciplinary is really necessary and we need to work hand in hand together archives are very essential we are looking at them right now to to understand the series of protein that we need to to um um understand in terms of reading the researches that were done previously so i hope it tells thank you sunya anyone else have anything they want to add to this question to follow up so we move on to um this next last question um one of our participants wants to know do we need to have some practical training to use artificial intelligence in the archives i can try something so for practical training i think they are i see two layers of this question so for training do you mean i think one way we can do it is like do uh archive archive like people working in archive like community need to learn how to use like uh artificial intelligence and i feel like uh sorry let me rephrase it so for artificial intelligence this can be a very large concept like we can have algorithms to help like digital slides and do ocr on some like real collections where the hand ratings is like differs by types of documents or we can have like uh artificial intelligence where they like uh machine learning algorithm to try to do classification like ask the let's say beat the classifier with a thousand documents and ask the machine to say okay do you think how many different types of documents are here maybe machine could say we have three types we have like runaway slave ads we have kind of like uh menu missions we have different types there are a variety of things that machine learning or ai could do and that's really depends on what like uh you want to do so the types of answer is very different and sorry that's clear as hope that defies AFR a little bit and then the training part for training and they're like in machine learning in the domain of machine learning people call training is because you need to have some like manually label documents for like for those like uh one type of algorithms that is uh the right uh that uh sorry that my hand got stuck okay so basically there are two types of algorithms ones you do not need to provide any like uh training data is like classification algorithms all right it's like clustering algorithms but for classifications and for predictions you probably need some manually labeled data and for the training uh okay for the training part you can have some manually labeled data and feed it to the algorithm and the algorithm can learn from what people have manually labeled and then is like then it can build on and do predictions after learning from that and that is learning from the uh machine learning perspective training sorry and then they're also training on people how to need to on people how to need to learn how to use those algorithms and they are like uh versus online uh on machine learning and there are also a lot of events and like Richard, Dr. Richard Masino and he's trying to investigate here like to have doctoral students have to have the skills and working with archival is and we like we learn from each other this kind of situation and all right I think that this question is a little bit I answered it a little bit too wordy and that I hope someone provides that I just I just want to add that I think that the best projects we've had and continue to have are our multidisciplinary our project first of all we everything we try to do to answer your question is kind of learning by doing that's our philosophy so you can certainly read about the theory take classes but in all our projects these last few years and we've had over 300 students and practitioners embedded in these kinds of projects we've always tried to put together interdisciplinary teams where you have at least one librarian one archivist computer science folks those are the most interesting and the best projects and I think there is a there is sort of a human dimension to that which is in moving forward all these kinds of digital projects will be highly interdisciplinary the complexity is beyond certainly beyond me beyond uh most of us and you need teams and when you when you have an archival perspective with a technological perspective and folks understand and develop the kind of confidence that is really needed in moving forward that you don't have to be an IT expert or an AI expert but can still contribute to the conversation and add value at from from all kinds of perspectives from the archival level from an ethics standpoint from a representational standpoint from a data loss my colleague Linise is interested in erasure she's actually embedded in she's a humanist who's embedded in a number of our projects and so so I would think of these collaboratives more in those terms you learn by proximity and you learn through diverse viewpoints and you learn by doing and so the most important thing to me is to try to jump in and engage and and create sort of an environmental conditions that allow you to experience those kinds of moments yeah and to add to that you know the goal of the data done that we did was to really understand explore the conceptual and methodology methodological challenges that you know has an implication on on the data and what with what what I did was I attended a lot of practical training so my point is there were a lot of data cleaning data analysis that we did which was actually the data cleaning was the the input to whatever artificial intelligence that you know we we may take in the future because those were the the future works that we would like to to work on but what what we benefited from from this data done is that we did series of presentations using general questions like you know what are the approaches the methodologies in the decision-making in terms of you know the the the the archive data that we we worked on what were the our interesting findings that we found and what were the obstacles that we encountered and of course we we were able to identify what would be the opportunities that we can look for so I mean the the data cleaning that we had are really essential as Yanlin said could be then the the output or the input for the next step of artificial intelligence all right so for the sake of time I want to say thank you to our participant thank you for joining us and I just also want to say thank you because this ends actually our series that I organized with Dr. Ezra on emerging technology, big data and archives and then regarding your last question actually one of the first webinar that we had on as part of this project was actually on artificial intelligence and archives so I will advise if you can go you know to the website and definitely find the the information that we have then actually also moving forward the recordings are going to be available as well on this so we want to thank Claire for actually sponsoring this hosting us we also want to say a full grant from the Mellon Foundation we also want to thank Oklahoma State University emerging technologies and creativity research lab for co-hosting this webinar we want to say a big thank you to the Schoenberg Center for Research in Black Culture but also New York University Library for actually allowing us to give us that space in that time to do this type of work so with that being said we want to say thank you please thank you for our panelists and then around our applause and we hope you're going to have a wonderful wonderful rest of the week so thank you so much for joining us thank you all thanks thank you for the for putting this series together it's really it's it's been really interesting thank you thank you for joining us