 We're going to get started here. Our session today is supporting computational research on large digital collections. We're going to do a little bit, talk a little program design, a little sort of framework recommendations, some lessons learned, and some tech demo. So hopefully, mixing up lots of things. I'm Jefferson Bailey, I'm director of archiving data services at Internet Archive. And myself and Nick Ruwe will be presenting the first part. And then our colleagues from the Library of Congress will be doing a presentation after us. So let's kick it off. So at least for this part of the talk, our outline we're just going to talk some general challenges that we have learned over the course of supporting big computational research projects. We'll talk a little bit about trying to turn these lessons learned into a text and data mining platform and service. Do a walkthrough of the platform, and then just talk about the associated work supporting specific research teams. So the challenge, I think, if you're in this session, you probably know what the challenge is. It's how to understand and address the technical, conceptual, and practical issues inherent in trying to support data-driven use of very large, generally born digital at this point collections. And we're talking generally in the petabyte scale. So in having supported this work over many years at Internet Archive, I'll try to at least summarize some of the technical, practical, and conceptual issues that have emerged in working with academic research institutes and scholarly teams and things like that. There's a lot of technical complexity around the various formats that we're providing to these folks, especially related to, for us, IA, the web archive and the work format and how crawlers work and things like this and how large-scale web collections are built and constructed. Of course, there's other things around, things like codecs for ADU materials. These are not issues that scholarly researchers generally bring with them to these projects. We're generally talking about hundreds of terabytes, if not petabytes of data, that we are delivering to them. So how do you even get it to them beyond shipping them boxes of hard drives? Introduces lots of challenges. Once they get this material, do they have the local processing to deal with it? Some do and some don't, and some don't know that they need it when they make the original request. There isn't always a lot of, if we're giving them data sets or derivations of these collections, there isn't necessarily great visibility into how the algorithm worked that created the data set. Sometimes we're running an algorithm for them and giving them output. Other times we are creating a data set that sort of meets the need that they're asking for. And there's, of course, derivation complexities for when we filter and subset and then run multiple jobs on data and then give them outputs. There's conceptual issues. The provenance of some of these collections is generally well documented, but documented often in a technical manner, not necessarily in an intellectual control manner. How the material was acquired, it could be donated to us. We could have archived it, of course, ourselves. We could have done it through a service with a collaborating library. There's different methods of acquisition and those are sometimes difficult to describe to researchers. There's border and boundary complexities around where does the data start and end? When did a web crawl start and end? Why did it go and get these things? Why are there these illusions? And what's the breadth of the data set that we can build a research corpora off of? And there's practical issues. How do they even come to us to ask for petabytes of data? There's really no great, at least for our institution, we're sort of an independent research library. We don't necessarily have great support in the staff or procedural areas for dealing with these petabyte scale requests. There's, of course, research agreements. Some of the data might be embargoed. It's generally all openly accessible public data, but, of course, how they can reshare it, how they can reconstitute it, what they can do with it after it gets to them evokes questions. Of course, there's budget and staff and program complexities. So these are sort of the models that we've tried to explore in getting people many, many terabytes slash petabytes of data. Both data model, give it all to them, either over the internet or over internet too or something like that, or shipping hard drives or filling hard drives and they come and pick it up from us in a truck or something like that, but basically just give them the data in raw form. Put the data on restricted access hosted infrastructure in our own data centers, which we own and operate ourselves. So we're doing this in a way that we can have high control over it and then give them access to it there and they can run their own jobs. They can roll their own scripts, algorithms, applications, give those to us and we can run them if they can't do that or if they don't have a high performance computing cluster themselves. There, so that's one model, the roll your own. Middleware is trying to figure out software solutions that can sort of bridge the cyber infrastructure and roll your own model where there's some customization that they can do even if they don't have direct access or control over the data jobs. We give people a lot of derived or extracted our filtered data sets. So prepackaged, we can generally create those before a researcher might request it because it's a collection that's of high interest or it's a format that we know people use like sort of graph things or topic models or things like that. So in some cases you can prepackage data sets that will then facilitate downstream research and scholarly use and then of course all of these have their own attendant support and community models for education, training, workshop, collaborative, Jupyter notebook development, whatever the case might be. So what are some of the practical lessons we've learned? Well, all large digital collections can be unkind to traditional methods of scholarly inquiry. This is certainly a big issue for us and that we have lots of, we have web, we have digitized, we have text, audio, visual, television, live television archive. So there's just so many different formats and topics and curatorial methods, acquisition strategies that if someone's coming in just looking for something that is the published works of this type of creator from this type of community, from this time span, it can just cross, a data set from that can cross hundreds of collections and many types of formats and that can be very difficult for them to use because they have to be handled separately. So trying to create pre-existing data sets, subsets, extractions, guides, whatnot to help them understand that complexity has been important. The technical and conceptual issues that I mentioned on previous slides, we have people that come and request multiple petabytes of data and they do not have their own high performance computing cluster to work with. They think they're gonna work with it on their laptop. I have no idea how they think they're gonna do this. Sometimes they just don't understand the scope and scale of the data that is available for a specific topic like disinformation or COVID. So helping them work through some of those technical issues upfront but then also how that informs their research question, their sort of corpora building to answer that question or to run analysis jobs against. People always think that more data is better data and more data plus better data equals better research equals tenure, fame, and profit. And that is definitely not the case, especially when you get into very broad disciplinary areas that can end up being yes petabytes of data that could be suitable for research. So trying to help them navigate the sort of what's available to what will be useful to what will be actually capable, they will be capable of working with has been important. And then often they're aggregating data that we might give them with data that they are getting from other places like journal content from their university journal subscription data agreement thing or from Wikimedia or from open commons digital collections. So working with them to understand how the data that we are giving them might complement a larger corpora that they are building for their research project has been important. So here's just some examples. We've tried notebooks, we've tried APIs, we've tried sort of interactive Kibana style dashboardy things, prepackaged data sets. So all that sort of typology I talked about earlier we've made attempts at to varying degrees of success. So the project that we're gonna talk about a little bit and I'll wrap my part up is that Archives Unleashed project which has been a long running project that has been working to focus on the web archive and historical internet content and make that accessible to scholars has been also working at that effort at the same time that we internet archive were working at this effort and with generous support from the Mellon Foundation we've been combining our sort of computational research support services and especially our infrastructure and our software and tooling to do a combined project. So we have this project called ARCH the Archives Research Compute Hub with the goals to standardize our tools and services to co-locate the data in the compute within the IA run data centers which has been just important for processing petabyte scale information is to have data and compute at the same place. Adding a much more sort of directly supported and embedded scholarly research teams to work with us as we build out the platform which is not something we had really had the capacity to do ourselves at internet archive and we've been working on this project since 2020 and are about to release the platform and have 20 to 30 libraries archives museums involved in the pilot as well as 50 scholars from around the world. So I'm gonna pass it over to Nick to talk more about the ARCH platform. Hey everybody, so I'm gonna talk about ARCH the Archives Research Compute Hub for a little bit and talk about some of the other kind of second part of the overall Mellon grant we have which is working with our cohorts so a group of five different research teams that are doing research using this platform that we built and I'm gonna probably be kind of frenetic and wave my arms so hopefully it makes sense. So ARCH the platform itself, the Archives Research Compute Hub it's an interactive web application so it's being used by collection your collection curators so people that subscribe to archive it so it's set up for archive users to use it but it's also set up so that you can just be a researcher that says hey I wanna use these web archives and I wanna be able to do this type of analysis it's set up and ready for that and we're actually using that with all of our research teams that are doing it. As of right now, you can generate and download over 20 derivative data sets from a given web archive collection and those also connect out to Google Colab to do further analysis on them depending on the size of the derivatives because some derivatives are too big to work in Google Colab. We did three rounds of UI and UX testing we took great care in this to actually do UI and UX in this as opposed to the cloud thing that I made a few years ago some of the people in this room were a part of that testing and their feedback and everything made the platform even better. In addition, there's in browser visualizations for each one of the data sets and data previews so you can kinda see what you're gonna get. And then the most important thing compared to the previous version of the platform is that it lives at internet archive and I'm not moving like 20 terabytes of data over the internet to compute center at the University of Victoria. If you're really interested in the actual stack itself it's everything here is the foundation of it. I'll go over this really quick. The icon on the left is Sparkling which is one of two large Apache Spark libraries that are kind of the underlying magic. So Sparkling, Archives Unleashed Toolkit which is a project the Archives Unleashed project has been working on for a number of years. Apache Spark for folks that don't know what that is it just allows like distributed computing across either like all the cores on single laptop or if you have 500 laptops all around the world you can pull all together and do stuff. Scalatra is the web framework that we're using it's a small little MVC thing written in Scala instead of Sinatra which would be like the Ruby one and then a Hadoop to call us all those hard drives and be able to do analysis across petabytes. If you're really, really interested in the technology behind this we have a JCDL paper that came out this year that really goes in depth on it if you look at the slides you can click out to it. This is not the paper. The source code is also public for this so if you wanna see the algorithms that we wrote to do each one of the drills you can go and get that that's a get up internet archive slash arch it's there, check it out. There's a Docker instance of it you can fire it up and run it on your own if you want. The paper that I just mentioned so I'm gonna skip across this and then just kinda go in and show you what it looks like so if you're an archived subscriber you might be familiar with this interface already so it just kind of lives inside archived but is a little kinda sidecar to it. So if you're logged in you have your collection page which is all the collections that you have available to your archived account. This one, this screenshot is just Ian and I's little test instance there's only three things in here but from this page basically you can click on any one of these which there's three here. You have a little bit of metadata about the last time a job was run the size of the overall collection, the job name and if the collection is public or not that's the archived collection. And then when you go into the actual collection itself you have another kind of landing page that gives you an overview of the collection so how many seeds so we're grabbing some things from the various internet archive and archived APIs. How many seeds are in that collection when's the last time it was crawled the overall size of the collection and how large it is so this is really beneficial for non-archived subscribers so researchers kind of looking at it to get an overview and then a table that has a list of the recently completed jobs and then jumping over to create a new data set. And so the data sets that you can generate we have four different categories of data sets and I'll go through them all really quick we have the collection category which has two derivatives. The first one is a very dead simple one which is the domain frequency it's two columns it's two column CSV which is a domain and how many times it occurs and that's it. And then there's the watt files which is like an old ARS derivative that you can do so we can so we port it over the old ARS derivative programs to run in arch as well. So that's the collection category the second category is network analysis so we have four different network analysis jobs these are all CSV output except for the LGA files a domain graph, crawl date source, target domain and counts of CSV files, image graph similar to the domain graph but you also get the image links and if possible the alt tags for it and the web graph which is one of the largest ones so you can take any one of these derivatives all the CSV derivatives and drop them into something like Geffy and start doing further analysis with that or you can load up the CSVs into any one of your network analysis frameworks of choice so like if you're in Python, NetX or something like that, NetworkX. The third category is text jobs so getting different types of playing with the text we have an NER job so named entities that you can generate this one's in like a JSON format it's like a bespoke format that Internet Archive has then we have two other jobs which are also CSV jobs as well the plain text of archives the web or plain text of each web page in a given archive so you get the crawl date, the last modified date the domain, the URL, the MIME type provided by the web server and Apache Tica and the actual content with the HTML and the HTML headers removed so if you're doing any type of text analysis you have a big giant blob of text and CSV and you can do whatever you want to do with that and then further so if you want to study like what the HTML or any of like CSS frameworks or JavaScript frameworks like that we have another job that leaves the HTTP headers and HTTPLN and it'll pull out just HTML files put them in the CSV text plain text files, CSS files, JSON files so there's like six sub derivatives in this one that you can generate and download and play with and then the final bucket is all of these file formats so audio files, image files, PDF files PowerPoint like presentation files spreadsheets, video and Word documents and you can play with all those there's CSV with a whole bunch of columns in it and then if you run one of these jobs this is what the dataset page looks like you get a browser visualization so this is one of the network graphs you get a little bit of metadata and then you get a download like a preview of what the CSV file looks like and then the download button this is that domain frequency job you get a nice little bar chart as well and kind of see what the top 10 looks like and then quickly, this is the new colab thing I was talking about so this is a file format job if the dataset is fits into a certain size you can then jump over to colab and start playing with this and this is what it looks like we have colab notebooks for or Jupyter notebooks or Python notebooks for 12 with a different job so anyone that's CSV based you can play with them in here just basically loads them into pandas and walks through what you can do like an example type of research so it's trying to bridge a gap that we've had that we've seen come up with a lot of our research teams that are like hey I got these CSV files like what do I do with them next and this is what you can do with it and so if you look at it through the lens of a researcher that's one way to do it but if you look at the lens of an archivist another way to say is like this is just like a big giant finding aid where you take everything that's in archivist and the derivatives and all together and it's a big giant finding aid and we argue our case in this paper if you're interested in that and then finally I wish I had more time to talk about this because this is one of my favorite parts of the project is we have two different cohort teams the purpose of them is to facilitate research with web archives each one's a year long the funding is that's messed up 10K US is about 11,500 Canadian bi-weekly calls so we meet with each of the teams every two weeks to talk through their research and if they need anything it's like we're both mentoring each other where like Ian and I have been working with this type of data forever and a lot of it's new to them and if you're interested in their work because I'm running way out of time now these are the projects they're really really cool they're really awesome this is the first group that wrapped up in June of this year this is the second group if you're interested in the first group the first group gave presentations back in the spring and all that video is on internet archive and if you check out that video you get a bonus of Quinn Don Broski talking about the SUCO project and stuff like that so to wrap it up I just want to thank our supporters Mellon Foundation made a lot of this happen of course and then that's it Hi there I'm Abby Potter I'm with LC Labs at the Library of Congress which is in the office of the Chief Information Officer which is our central IT service unit thanks for joining us Megan Fariders my colleague also in LC Labs will be talking right after me so I'm going to share about some resources that we've developed to help plan and assess AI artificial intelligence technologies in a library context and specifically our experimental labs context and then we're gonna Megan's gonna share about the computing cultural heritage in the cloud initiative which is investigating how the library might enable large-scale data intensive research with cloud services and then how we've used our AI planning planning framework to support that work okay so we really see AI as the next the sort of the next in a series of waves of technology change that libraries have had to ride so thinking about microfilm digitization online accessibility and availability search and discovery digital preservation that with those sort of technological waves and then this AI technological wave there are shared drivers and challenges that are sort of bringing this wave to us and some of those is that we we all want to stay relevant and current with user expectations a lot of our users are expecting the content to be digital to be searchable to be discoverable reusable they want content that's relevant and connected to their specific tasker goal and they want sort of very granular access and I think that the sort of data research that our colleagues shared before is a good example of that and then in our organizations and libraries we all have a lot of digital content and that the scale and variety of that content is going to continue to grow exponentially but our staff size and our technical capabilities are probably not going to grow in the same sort of exponential way and AI and ML do have seem like they have the potential to help bring order to our sort of unstructured noisy and consistent data and sort of help us to meet our user to sort of meet some user needs but there's also some challenges so even though our collections are really of high quality they're vast they're produced to our standards they still include historic biases they represent an incomplete record they're selected from larger collections and are created in different contexts for various reasons the level of description of these collections very widely the resolution and quality of scans or any audio or images also very widely sometimes they include offensive material non-factual material and errors so this is in all of our collections but the transformations that our collections take that happen as they go sort of through a digital pipeline are really the product of decisions that are made about practices and the technology available at the time and we know that some of that information but sometimes our users don't so we want to know what would our users make of these transformations what would an AI system make of these transformations so and then another big challenge is the lack of transparency for how AI and ML systems are trained and how accurate they work with content that we hold and sort of the mismatch in sort of the knowledge and the systems that vendors are selling us and the actual data backing up those claims that they're making and sort of the lack of transparency of how training data is created and how it's used and how models are trained so with these shared challenges we are thinking back to how we sort of dealt with those other waves of technological change and thinking of what can we as a sector do about it and I think it's going back to what we are good at which is knowing our content and knowing our users and relying on our sort of detailed ability to sort of create standards and community standards that we can use when we work with vendors or with each other so we're not the only in labs we're not the only people thinking about this or thinking about how we can use AI technology in a way that works for our sector so this is just a quick overview of some of the other kind of foundation of work in this area so and you can see sort of themes emerging from some of these the recently the OSTP released a different set of information about the blueprint for an AI Bill of Rights so sort of centering the human experience in AI it's not labeled in the middle here there is a trustworthy AI framework that NIST is working on that's going to be released in January that map out the emphasize the level of governance and maintenance that AI systems are required to sort of meet a trustworthy standard and then there's also you know some foundational ethical work and this other the IT centers of excellence AI capability model it helps give a picture of the sort of how many elements of how many sort of elements of work are involved in having a mature AI system that it requires you know people data infrastructure cloud model model security and moving from an individual project up to sort of an enterprise level they sort of map out different levels of how you be there and I think we're even I think some Cliff mentioned before that may we've done enough experimentation around AI machine learning but it's hard to move to the second you know to up from individual to enterprise state without sort of a shared idea and specific information about sort of what we're doing so we this is just a quick this is not really meant to read but we've been experimenting in labs with machine learning since about 2019 we held a we've done events we've done papers structured the some work with contractors work with innovator and residences and we're and we have a lot of recommendations this is not the complete list and then sort of some next steps and you can see sort of the top and so the most recommended or next step that that we do is sort of implement design principles risk frameworks for AI and ML and so that's so these are the resources that we've developed in LC labs again based on our context which is in the experimental lab and and this or and this is sort of building out an organizational profile so this was recommended in the in the NIST trustworthy AI framework so this is so we tried to think specifically about what AI would be used at the Library of Congress and discovery at scale this is sort of one this quadrant this is where we've done most of our experimentation to sort of generate metadata that that to increase users the discovery of our collections enabling research use that's that's sort of what we think of as preparing transforming data sets so that researchers may use them sort of what was just discussed before and then these bottom two are two we haven't really done a lot in so looking at different business cases data processing data management digital preservation sort of different ways that we might use this technology internally where the users of these systems are mainly internal staff and then of course the augmenting user services of implementing things like chatbots implementing things like voice search over collections so trying to think specifically about AI and not just like one giant sort of solution or system that may be we may want to consider so and then like I was kind of mentioning before the required vendor documentation so the we've created a digital data what we call data processing plan and we have a version of it and we have it required in this new contracting vehicle that we just released it's called the digital innovation IDIQ and if you're interested in that I can talk to you about it later but but it's any data transformation that involves AI or ML the vendors are required to fill out this really long form and it sort of talks about things like data data labels model cards which are sort of becoming more standard practice and in sort of trustworthy AI systems but then include things like data provenance and other domain specific information that we want to capture so we have that information about transformation and then other information about data the data that is part of these transformations that we as the stewards will probably have to collect and maintain also and then another tool we've been working on is this matrix for assessing benefits and risk so we're sort of filling it out line by line or project by project and it is a lot of basic sort of project management questions that you may work through but it also sort of gets at one of the at some of the key issues around planning for AI and one is being really articulating who's going to touch these systems who's going to and then articulating the risk and benefits for different user groups and recognizing they may be different so users versus staff versus these sort of organizations and then really trying to document the data readiness that's a big sort of block for using these systems and a lot of our blocker for using these systems and with our data is it's it's we don't really have a lot of training data right now and so sort of trying to document that to sort of think about and like I said we are looking at this sort of project by project but we in the matrix we have organized with the according also to the organizational profile so we can over time look at if there's certain areas of the profile where maybe there's more risks or more benefits that we can sort of look at over time and and think okay well this is where we should point our prioritize our energy this is another sort of coming to the discussion of values and articulation of principles so in our our crowd sourcing program that Megan Farrier led the Concordia the open source software tool that runs that program or the technical part of that program anyway she devised some design principles that help sort of guide decision making and self help sort of articulate upfront what we're trying to achieve at a broad level with these systems and this is sort of where we we were trying to bring these things together and think about okay what if we you know and we've we've kind of organized it this way where the little houses are this is where we have sort of started and this is where we're currently active and sort of if you think about the at the bottom so the data processing plan trying to collect data about AI processes that are happening adding on sort of this risk and benefit analysis matrix and then building sort of building from the bottom up you could also start from the top where you you identify you know short shared statements of values so we've had some ongoing conversations with NARA and Smithsonian and Virginia Tech about how could we develop some sort could we or should we develop some shared statements about how organizations this want to use AI and then sort of building out our own profiles and and in our own statements about how we want to use AI to sort of have that be a starting point so this is our we put up a this is our website where you can find all the the code and the papers from our experiments and then we're starting to put some of these framework elements out here too and and it's a and we're hoping and this is a call to to help to sort of not to not be paralyzed by you know AI that's coming for us that that we can we digitization we've done with digital preservation where we you know establish shared standards that we then communicate to the world that the world can then sort of design too so that's what our that's what our proposal is I'm going to hang it off too Megan Thank you very much Abby for handing over and Jefferson and Nick I think there's a lot of coherence between this we were looking at the broader scale here with thinking about implement implementing AI and ML in our context but this particular project computing cultural heritage in the cloud is a starting where we can and it was also being developed concurrently in the same timeframe as we've been exploring machine learning over the last few years so this is a grant that is supported by the Mellon Foundation that began in 2019 these are some of the goals and you will see that we wanted to produce models and we were going to capture costs but we didn't account for the building the knowledge exchange the collaboration the convening that happened as a part of this work and those are some of the things that we hope to share as we are rounding out this grant I also want to give thank you thanks to some of our advisory board members who are here at CNI and have been really active in sharing their experiences with this type of work we also hope as we complete this grant that we'll be documenting the use and capabilities of the cloud environment existing and possible library data processing and transfer workflows which is a lot of words to say in our context some of these things are more difficult than we than we may be anticipated both in the range of thinking about scale and also shifting from a model of serving smaller scale research into broad research and the types of questions that can be asked of data with appropriate documentation and scaffolding and support from subject matter expertise at the library we also are hoping to think through the required staff requirements different types of feasibility around cloud services and some of the sharing findings more broadly in developing collaborative approaches in this space we have this project has been broken into six phases and we are in the penultimate fifth phase moving into a transition phase these are some of the things that we've undertaken so far we had a little bit of a pause during the preliminary phase of the pandemic and then picked back up and brought on new staff we've had details we've collaborated with colleagues across the library and done a number of knowledge exchange activities as well so we have assessed the readiness of our collections data to be used by these types of approaches both by engaging with scholars and computational researchers who were ready to ask questions of our collections and some of our forms of data were not ready to respond to those kinds of questions we also have documented our capabilities and some of the limitations of our cloud services because many of the models of our cloud services have not been designed to support user needs in this way but they could support user needs in this way we've supported 10 different expert researchers in two phases one in a longer phase so three researchers and digital humanists and computational researchers Andromeda Yelton Dr. Lincoln Mullen and Dr. Lauren Tilton and then we just recently I'll give you a snapshot in just a moment of a data jam that we hosted with seven additional researchers asking them to look at what had changed from the way our data were presented at the beginning of CCHC to some very specific data packaging and availability in the cloud and just a quick note we will preview to you our emerging service model recommendations as a part of this grant primarily some of the things we're thinking through are continuing to experiment around transforming data that's very significant in our context our LC Labs context in the library and our positioning and the flexibility and the privilege we have to try new approaches and to ask people to engage with us and give us feedback we also in the second icon here of people whose hands are meeting together at the table is to continue to continue to support our users with staff roles and programs for self-service we've heard this from a range of different types of models including some of the things that Jefferson shared and some of the experiences of our advisory board members as well and we also know that this works very well across other means of supporting digital scholarship and computational research and we just recently had feedback from our data jam participants about the ways that they would like to see knowledge exchange and other forms of self-service in accessing data and data packages we also will continue to design and use cloud infrastructure but think about it bless as a platforms as a service or infrastructures as a service and perhaps more like data as a service what would that mean what are the implications what are the requirements staffing requirements and different types of resource requirements what's also the permanence or impermanence of data that would be used in these particular ways and then we will continue to develop staff competencies in enhanced collaboration strengths across our already extremely talented and generous colleagues and partners so let's see you so a quick look on some of our current focus in in CCHC we've been rounding out a large phase of information gathering and knowledge exchange with users of library collections as data and we've been thinking through the different ways that we may be able to make some preliminary steps and connect back to some of these machine learning recommendations that we have as well so we've been developing data packages in very specific ways so thinking through and applying that data processing plan that Abby showed we've used this in developing our data packages so we are documenting transformation a broader set of collections cover sheets which are very similar to data sheets for data sets if you're familiar with that concept we've been hosting working groups within our context and bringing together the shared challenges and needs across our organization and we also have established a cloud services sandbox for experimentation which is called data.labs.loc.gov as of right now the public facing version of that includes three data packages that can be that people can engage with includes data samples for bulk download and then programmatic access to an S3 bucket this space is also allowing us to pursue other types of experimentation and particular opportunities for staff members to be involved and to be thinking about sharing the responsibility of managing data and access to data in the cloud and we also continue to get user feedback in our recent cchd data jam we brought together colleagues from across the library to help us establish these data packages and then we shared an invitation to engage with these data packages and provide us real-time feedback seven experienced data wranglers access this library data through through three different programmatic access pathways specifically using a software development kit using REST API the AWS REST API and also using the command line interface and we got immediate and excellent feedback that trying to access data using the REST API in Amazon Web Services was the least effective for these researchers so it gives us a starting point to begin to continue to continue the experimentation and build out some of these practices this is a slide from Dr. Zoe LeBlanc's presentation which was as many of our other participants in the data jam were able to really tell us how much they wish they had more time to spend with the data sets we only gave them a month to engage with these data packages and kind of give us the feedback and we were able to solicit real-time feedback on the current technical setup the ways that people approach trying to blend and combine parts and interpret different types of data and then opportunities to improve upon the documentation from these researchers this we have just recently released some blog posts about this data jam documenting these pathways for developing replicable data package methods so we can share those with you those are available on the Library of Congress signal blog where we really were describing our particular experimentation and the the connection and the data pipeline that we have available to us at the library and where new interventions need to be made within those pipelines here's a view into our data sandbox data.labs.loc.gov as I described and I just want to move just quickly and briefly into what Abby described as some of the ways that we've been carrying forward those machine learning recommendations into the work of CCHC and then also using CCHC to test some of those recommendations so one of the things is that we have articulated values at each stage of this grant so far and and come back and touched on what does it mean to enact those types of values and how do we carry that forward both in the communication and collaboration but also in the practices of making data more accessible we are also just as the Archives Unleashed Project is sharing the outcomes of these different types of work in GitHub repositories we're sharing the work of our researchers who contributed to the first phase of CCHC in GitHub just as we do with our other LC Labs experiments which includes code and documentation and training data and then we also have as I mentioned before in our data packaging followed these guidelines to use a data processing plan and really thoroughly document across that way so we anticipate in the end that we will have a large range of modular components of this grant outcomes we'd really like to be able to share it with people and engage with colleagues who are in different organizations approaching very similar challenges one thing at the very end I'd like to point out that CCHC is a little bit amorphous as you're looking at it it's a cloud for sure but as it passes we really see the artifacts and legacy of this work in the library including in socializing these types of approaches and making real some of the components of implementing ML machine learning and we also have along the way improve some of our other infrastructure including documentation of the lsc.gov json and yaml api and other dimensions of our cloud service implementation thank you very much that we have just a couple of minutes for questions and we'll also be very happy to talk with more with you after are there any questions well I have one question in Jess which is there anything that you're not doing at LC because that was impressive a question actually for Nick and Jefferson which may feed into our LC colleagues are you doing it looked like you were doing a lot of I don't want to say traditional but straightforward data mining routines did I and maybe I didn't track are you also doing any machine learning as part of what you described for us a couple a couple of cases but not a whole lot most of our ML work is internal and not done as a response to researcher demand but we do have a couple of cases where we have done that and they have given us training sets and given us their algorithms and their models to run internally on collections when it was data that we couldn't give to them directly so a little bit but not much yeah and if you want to call any are machine learning that's about the only like AI or machine learning data set that's in the arch like packages stuff hi my question is for the LC folks how in the world did you get through the bureaucracy and get permission to use AWS that's our current or one of our current contracts for a cloud service provider it was a place where we actually as Abby mentioned we have about five years of LC labs experimentation and this is a place that we have been able to share some smaller forms of data already it's a place where we had permissions and access already and a place where we were able to coordinate with colleagues and provide resource justification for using that space I thought the data jam was a very innovative idea and my questions have to do with in terms of that data jam is your final audience in terms of the data generating machine learning models or better algorithms for insight or also actually generating actual insight from the data itself thank you for the question this purpose of this particular data jam was to generate insight on the technical access pathways that we were prototyping as well as the documentation the components of the documentation what might be missing from from for real users the challenges that they face every day we heard a lot about the ways they'd like to connect with other data sources we also heard about many of the ways that they felt that subject matter expertise and connection to our colleagues would be really imperative we do think as a part of establishing the data.labs.loc.gov sandbox a space this would be a great space for continuing to document and share machine learning training data which is obviously a broad call across the community and a place where we as to this point have only been able to share via github repositories and contextualize within those types of projects so we're kind of looking at this as a step by step by step different types of users what's useful for us and then as we work with vendors and continue to improve our acquisitions vehicles that we are taking the outcomes of that and making that useful for our own purposes and for a broader community. Okay so that was the earlier question was just a preamble one of the challenges in this space is everything is moving so quickly that by the time you finish a project you would never have used the technologies that you started with and I'm wondering if you can both opine about how you've built in that tolerance for change or future proofing or if you don't care and why not well I think these are human problems as much as they are technical problems so having mechanisms to have knowledge exchange with people who are practicing at the forefront as well as people who are stewarding and staying connected to traditional practices is a great spot to be in again we're lucky to be situated in experimental context that allows us to carve out a little bit of space for that but I do agree people ask us are you using this method or have you what do you think about that method and my answer is we're really far from we're not the ones who are implementing this we are creating space for other people to tell us what we need to do to be more useful or effective for those methods yeah for us to in response to your earlier question I think it would be us running ML AI jobs for people on our collection is going to be way behind the curve of what we could do by brokering partnerships with the AI firms themselves to get subsidized access to the APIs or whatever and potentially share data and do it that way so it's mostly through partnerships with the AI firms which we're talking to to try to figure out how that can be useful to libraries and the scholarly community instead of trying to take their tools and run it ourselves which are always going to be at least a couple of years behind right yeah and I think that some you know thinking about the organizational profile and the sort of discovery at scale quadrant I think that it's not really cutting edge technology that we have to concern ourselves with there so I think it's okay to be behind the curve as long as we're documenting what we're doing along the way but I guess the only thing I have to add if we're all going to add something in terms of the portions of the project that I've worked on we've completely rewrote the underlying platform like I want to say it two or three times like it started off as Apache pig when I first started on the project and like 2016 or whatever with work base and it's evolved but it's the same basic underlying I guess algorithms it's just moving them into a different language or framework or whatever all right I think we're over time anyway thank you thank you